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Abstract 

It  is  well  known  that  there  is  room  for  improvement  in  the  resultant  quality  of 
speech  synthesizers  in  use  today.  This  research  focuses  on  the  improvement  of  speech 
synthesis  by  analyzing  various  models  for  speech  signals.  An  improvement  in  synthe¬ 
sis  quality  will  benefit  any  system  incorporating  speech  synthesis.  Many  synthesizers 
in  use  today  use  linear  predictive  coding  (LPC)  techniques  and  only  use  one  set  of  vo¬ 
cal  tract  parameters  per  analysis  frame  or  pitch  period  for  pitch-synchronous  synthe¬ 
sizers.  This  work  is  motivated  by  the  two-phase  analysis-synthesis  model  proposed 
by  Krishnamurthy.  In  lieu  of  electroglottograph  data  for  vocal  tract  model  transition 
point  determination,  this  work  estimates  this  point  directly  from  the  speech  signal. 
The  work  then  evaluates  the  potenticd  of  the  two-phase  damped-exponential  model 
for  synthetic  speech  quality  improvement.  LPC  and  damped-exponential  models 
axe  used  for  synthesis.  Statistical  analysis  of  data  collected  in  a  subjective  listening 
test  indicates  a  statistically  significant  improvement  (at  the  0.05  significance  level) 
in  quality  using  this  two-phase  damped-exponential  model  over  single- phase  LPC, 
single-phase  damped-exponential,  and  two-phase  LPC  for  the  speakers,  sentences, 
and  model  orders  used.  This  subjective  test  shows  the  potential  for  quality  improve¬ 
ment  of  synthesized  speech  and  supports  the  need  for  further  research  and  testing. 


A  TWO-PHASE  DAMPED-EXPONENTIAL  MODEL  FOR 

SPEECH  SYNTHESIS 


L  Introduction 

This  thesis  considers  the  problem  of  improving  the  quality  of  speech  generated  by 
speech  synthesizers.  Speech  synthesis  has  several  military  and  commercial  applica¬ 
tions  including; 

•  Speech  output  for  information  systems. 

•  Speech  warnings  for  aircraft  and  other  machinery. 

•  Speech  output  for  automatic  reading  systems  for  the  blind. 

Today,  state-of-the-art  synthesizers  generate  highly  intelligible  speech;  how¬ 
ever,  modern  speech  synthesizers  still  cannot  generate  natural-sounding  speech  [23]. 
This  thesis  outlines  several  models  for  speech  signals  and  documents  some  experi¬ 
ments  that  examine  the  effects  of  these  models  on  synthetic  speech  quality. 

1.1  Overview 

Much  of  the  past  research  on  speech  synthesis  for  military  applications  has 
focused  on  improving  the  intelligibility,  not  the  quahty,  of  the  synthesized  speech  [20- 
22].  However,  improving  the  quality  of  synthesized  speech  often  also  improves  the 
intelligibility.  This  quality  improvement  was  not  of  primary  concern  to  much  of  the 
past  research  on  speech  synthesis  for  military  applications.  This  thesis  investigates 
the  effects  various  models  for  speech  signals  have  on  the  quality  of  synthesized  speech. 

Much  of  the  current  research  focuses  on  the  analysis  of  digitized  speech  wave¬ 
forms,  extraction  of  key  parameters  from  the  speech,  and  regeneration  of  the  speech 
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signal  based  only  on  the  extracted  parameters  and  the  chosen  model  for  speech  syn¬ 
thesis  [2,14-16,20,22,23].  This  process  is  called  analysis-synthesis.  Systems  using 
analysis-synthesis  reconstruct  spoken  words  in  the  same  order  as  stored  speech.  An¬ 
other  method  for  synthesizing  speech  is  text-to-speech  synthesis  (TTS).  TTS  systems 
convert  textual  information  into  a  speech  signal  based  on  stored  characteristics  of 
one  or  more  speakers’  voices.  There  are  several  stages  within  the  TTS  system  of 
which  the  synthesis  stage  is  one.  Analysis-synthesis  can  be  used  to  evaluate  the 
effects  of  various  signal  models  on  the  synthetic  speech  quality  without  the  need  for 
the  additional  stages  and  without  their  distorting  effects.  Despite  the  differences 
between  analysis-synthesis  and  text-to-speech  synthesis,  improvements  made  using 
analysis-synthesis  can  be  incorporated  into  the  more  complex  TTS  systems. 

There  are  several  models  for  analysis-synthesis  of  speech  waveforms  that  pro¬ 
duce  telephone  quality  or  better  speech  signals  [3,16,23,30].  There  are  still  no 
speech  synthesis  systems  capable  of  generating  speech  that  is  indistinguishable  by 
humans  from  actual  speech  at  reasonable  expense.  The  problem  involves  “inadequate 
modeling  of  human  speech  production  in  coarticulation,  intonation,  and  vocal-tract 
excitation”  [23].  If  we  model  speech  production  in  these  areas  more  accurately,  we 
should  be  able  to  generate  synthetic  speech  that  is  of  higher  quality  than  previous 
methods. 

In  [15],  Krishnamurthy  presented  a  technique  for  analysis-synthesis  using  a 
two-phase  approach.  His  technique  estimates  the  model  parameters  in  two  stages, 
or  phases.  First,  it  estimates  the  model  parameters  over  the  open  glottal  phase 
(when  the  vocal  cords  are  open)  and  then  again  during  the  closed  phase  (when  the 
vocal  cords  are  closed).  The  synthetic  speech  waveform  is  generated  after  estimating 
parameters  for  both  the  excitation  waveform  and  the  vocal  tract  filter  [15].  The 
difference  between  his  proposal  and  systems  already  developed  is  the  use  of  unique 
model  parameters  for  both  the  open  phase  and  the  closed  phase  for  the  glottal 


2 


excitation  as  well  as  the  vocal  tract  filter.  Krishnamurthy’s  model  is  the  motivating 
factor  behind  this  research. 

1.2  Problem 

Most  current  speech  synthesizers  assume  that  speech  characteristics  are  con¬ 
stant  over  an  entire  pitch  period.  This  work  investigates  the  quality  of  speech  synthe¬ 
sized  using  a  two-phase  “Sum-of- Exponentials”  model  for  synthesizing  speech  similar 
to  the  model  proposed  by  Krishnamurthy  [15].  This  model  allows  for  the  vocal  tract 
model  parameters  to  vary  over  the  two  phases.  It  is  hypothesized  that  this  approach 
will  allow  for  more  natural  sounding  speech  synthesis.  This  work  is  different  than 
Krishnamurthy’s  in  that  the  phase  transition  point  and  glottal  closing  instants  will 
be  estimated  directly  from  the  speech  signal  without  the  use  of  an  electroglottograph 
sampled  simultaneously  with  the  original  speech  samples.  In  addition,  complete  sen¬ 
tences  are  analyzed  and  synthesized  as  opposed  to  the  steady  state  vowels  used  by 
Krishnamurthy  [15]. 

1.3  Definition  of  Terms 

1.  Pitch:  The  fundamental  frequency  (/o)  of  a  sound  or  speech  signal  [12]  corre¬ 
sponding  to  the  repetition  rate  of  puffs  of  air  exiting  the  vocal  cords  as  they 
open  and  close. 

2.  Pitch  Period:  The  fractional  inverse  of  the  pitch  (l//o)- 

3.  Glottis:  The  area  between  the  vocal  cords. 

4.  Closed  Glottal  Phase:  The  interval  within  a  pitch  period  in  which  the  vocal 
cords  are  together  (i.e.  the  glottis  is  closed). 

5.  Open  Glottal  Phcise:  The  interval  within  a  pitch  period  in  which  the  vocal 
cords  are  separated  (i.e.  the  glottis  is  open). 
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6.  LPC  Synthesizer:  A  speech  synthesis  system  that  generates  voiced  speech  by 
filtering  a  quasi-periodic  excitation  waveform  with  a  hnear,  time-varying,  au¬ 
toregressive  (all-pole)  filter.  The  filter  poles  are  determined  using  linear  pre¬ 
dictive  coding  (LPC)  techniques.  The  system  uses  a  broadband  noise  source 
as  the  filter  input  for  unvoiced  speech. 

1.4.  Assumptions 

In  this  research,  it  is  assumed  that  speech  files  are  phonemically  labeled  as  a 
function  of  sample  index  as  identified  by  a  group  of  experts.  The  TIMIT  speech  data 
base  contains  speech  files  labeled  in  this  manner  and  will  therefore  be  used  [1]. 

1.5  Scope 

Software  development  for  speech  analysis-synthesis  consists  of  one-  and  two- 
phase  damped-exponential  models  and  one-  and  two-phase  LPC  models.  Results 
and  conclusions  are  drawn  from  data  collected  in  subjective  listening  tests  using 
19  subjects.  These  tests  compare  the  speech  synthesis  techniques  using  a  one- 
phase  damped-exponential  model,  a  one-phase  LPC  model,  the  two-phase  damped- 
exponential  model,  and  a  two-phase  LPC  model. 

1.6  Approach/ Methodology 

First  an  analysis-synthesis  system  is  developed  using  the  sum-of-exponential 
(also  termed  “damped-exponential”)  model  while  applying  the  model  to  the  entire 
pitch  period.  There  will  be  no  open  and  closed  phase  modeling  in  this  first  system. 
Next  another  analysis-system  is  developed  using  the  two-phase  model  similar  to  that 
proposed  by  Krishnamurthy.  Finally,  the  one-  and  two-phase  systems  are  adapted 
for  use  with  LPC  techniques.  Subjective  listening  tests  are  conducted  and  statistical 
analyses  performed  to  evaluate  the  effects  on  quality  of  the  speech  signal  models 
used  in  the  analysis-synthesis  systems. 
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1.7  Outline 


An  outline  of  this  thesis  is  as  follows.  Chapter  II  provides  brief  synopsis  of 
speech  synthesis  and  key  techniques.  The  chapter  includes:  a  section  on  speech 
synthesizers  from  early  vocoders  to  state-of-the-art  systems;  a  review  of  various 
models  proposed  for  the  glottal  source  waveform;  and  the  model  developed  by  Kr- 
ishnamurthy.  Chapter  III  contains  the  description  of  the  speech  synthesis  systems 
developed.  Chapter  IV  describes  the  subjective  listening  test  and  statistical  analy¬ 
ses  performed,  and  presents  the  results  obtained  from  these  experiments.  Finally, 
Chapter  V  provides  a  summary  of  the  results  and  answers  to  the  two  fundamental 
questions  for  this  thesis: 

1.  Does  the  use  of  a  two-phase  damped-exponential  model  in  voiced-speech  syn¬ 
thesis  show  improvement  in  quality  over  that  obtained  with  one-  and  two-phase 
LPC  or  one-phase  damped  exponential  models? 

2.  Does  the  use  of  a  two-phase  model  in  general  show  an  improvement  in  quality 
over  one-phase  models? 

Answering  these  questions  is  the  overall  goal  of  this  work. 
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IL  Speech  Synthesis  Background 

It  is  well  known  that  the  quality  of  synthetic  speech  is  directly  affected  by  the 
models  chosen  to  generate  it.  Variations  in  the  shape  of  the  pulse  used  to  represent 
the  glottal  excitation  (volume  velocity  of  air  as  it  enters  the  glottis)  from  the  actual 
shape  cause  degradations  in  the  quality  of  the  resulting  speech  waveform  [3,8,9,23]. 
Another  source  of  degradation  in  synthetic  speech  is  the  failure  to  model  the  coupling 
between  the  subglottal  (lungs  and  trachea  up  to  the  glottis)  and  supraglottal  (glottis 
to  lips  and  nostrils)  systems.  Many  models  for  speech  assume  that  the  characteristics 
of  the  vocal  tract  are  fixed  over  a  pitch  period.  However,  the  coupling  between 
the  subglottal  and  supraglottal  systems  causes  a  shift  in  formant  frequencies  from 
closing  to  opening  phases  and  vice  versa  [15].  It  is  expected  that  speech  synthesis 
using  a  model  accounting  for  this  coupling  will  enhance  the  natural  quality  of  speech 
synthesis  [15,23]. 

This  chapter  first  gives  a  brief  history  of  the  development  of  speech  synthesizers. 
It  then  introduces  a  model  proposed  by  Krishnamurthy  [15]  which  does  account  for 
the  coupling  between  the  subglottal  and  supraglottal  systems. 

2.1  Historical  Perspective 

The  scientific  study  of  speech  began  during  the  Renaissance  when  mechanical 
models  were  first  constructed  to  imitate  speech.  Mechanical  speech  synthesis  was 
first  well  documented  in  St.  Petersburg  and  Vienna  in  the  late  18^^  century  [6]. 
However,  the  beginning  of  the  era  of  modern  speech  technology  is  considered  to  be 
the  1930’s.  In  1939,  Dudley  introduced  the  vocoder  (voice  coder)  which  led  to  the 
idea  of  parametric  speech  representation  and  coding.  And  so  began  an  explosion  in 
speech  research  [12]. 

In  1964,  Holmes  presented  a  speech  synthesis  system  that  generated  an  analog 
speech  signal  by  adding  a  small  number  of  sinusoids  together  [11].  This  system  was 
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one  of  the  first  of  a  class  of  synthesizers  called  formant  synthesizers.  The  system 
determines  the  number,  frequency,  amplitude,  and  duration  of  each  sinusoid  using  a 
set  of  rules  for  each  sound  to  be  synthesized. 

Between  1964  and  1973  a  class  of  algorithms  called  LPC  (linear  predictive 
coding)  vocoders  were  developed.  Basic  LPC  vocoders  generate  a  speech  signal  by 
modeling  the  glottal  excitation  as  either  a  quasi-periodic  train  of  impulses  for  voiced 
speech  or  random  noise  for  unvoiced  speech.  LPC  vocoders  filter  the  excitation 
with  an  all-pole  linear  filter  Avith  poles  corresponding  to  the  formant  resonances  of 
the  vocal  tract.  The  resultant  synthetic  waveform  is  highly  intelligible,  but  very 
unnatural  in  sound. 

In  an  effort  to  improve  the  quality  of  speech  synthesized  by  LPC  techniques, 
Rosenberg  investigated  various  models  for  the  glottal  excitation  [27].  He  found  that 
both  trigonometric  and  polynomial  pulse  shapes  improve  speech  quality  relative 
to  that  obtained  with  simple  impulse  excitations.  Using  Rosenberg’s  glottal  pulse 
shapes.  Holmes  was  able  to  improve  the  quality  of  speech  produced  by  his  formant 
synthesizer  [10]. 

Several  scientists  have  developed  similar  models  for  glottal  excitation  [8,9,14] 
while  others  developed  models  for  the  derivative  of  the  glottal  pulse  [5,7].  These 
models  laid  the  foundation  for  the  development  of  state-of-the-art  synthesizers  such 
as  Digital  Equipment  Corporation’s  DECtalk.  AT&T  has  also  produced  a  highly- 
intelligible  text-to-speech  synthesizer  based  on  glottal  pulse  excitations  for  voiced 
speech  [30].  Still,  these  state-of-the-art  synthesizers  do  not  produce  natural-sounding 
speech. 

2.2  Krishnamurthy’s  Two- Phase  Model 

In  1992,  Krishnamurthy  developed  an  algorithm  for  jointly  determining  the 
resonances  of  the  vocal  tract  and  the  parameters  of  the  glottal  source  waveform  [15]. 
The  algorithm  uses  the  Liljiencrants-Fant  (LF)  glottal- flow  derivative  model  [5]  as 
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the  eflFective  source  to  the  vocal  tract  filter.  The  algorithm  also  allows  for  the  coupling 
between  the  subglottal  and  supraglottal  systems  during  the  open  phase  of  a  pitch 
period  by  allowing  different  filter  models  for  each  phase.  Krishnamurthy’s  algorithm 
provides  the  motivation  for  this  research.  This  section  of  the  review  describes  the 
models  and  analysis  procedure  in  more  detail. 

2.2.1  Speech  Production  Model.  Figure  1  illustrates  the  model  of  speech 
production  used  in  this  analysis.  The  vocal  tract  filter  V{z)  is  the  transfer  function 
relating  the  volume  velocity  at  the  lips  to  the  glottal  volume  velocity.  The  input 
to  this  filter  is  the  effective  voice  source,  q{n)  defined  as  the  differentiated  glottal 
volume  velocity.  The  effective  voice  source  models  the  effects  of  the  lip  radiation 
as  well  as  the  glottal  volume  velocity.  The  output  of  the  filter,  s{n)  is  the  radiated 
speech  pressure  wave  (sampled  at  a  microphone  at  time  index  n)  [15]. 


Figure  1  Speech  production  model. 


2.2. 1.1  Effective  Voice  Source  Model.  The  effective  voice  source  over 
one  pitch  period  is  modeled  with  a  discrete-time  version  of  the  model  proposed  by 
Liljiencrants  and  Fant  (LF)  [5].  Realizing  that  the  radiation  through  the  lips  can 
be  modeled  as  a  first  order  filter,  they  derived  a  model  for  the  derivative  of  the 
glottal  volume  velocity  [5].  The  LF  model  couples  the  volume  velocity  pulse  with 
the  radiation  effects  of  the  lips.  The  continuous- time  mathematical  form  of  the  model 


^  J  £;oe“*sin{a/g(t)),  0  <  t  < 

^  T.  < «  <  r. 

A  sample  pulse  derivative  waveform  is  illustrated  in  Figure  2. 
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Figure  2  Liljiencrants-Fant  glottal  flow  derivative  waveform  for  parameter  set 
F;o=2622.799,  a=0.032,  fg=0m5,  Te=32,  e=2. 

Krishnamurthy’s  discrete  version  of  the  LF  model  is  defined  as 


q{n)  = 


Agoe°‘^°'^ sm{ujgon  +  (f>go),  n  =  0,...,N  -1 


—  A  p-agc(n-N) 
■^gc^  ) 


n  =  iV, . . . ,  M  -  1 


where  the  subscripts  go  and  gc  identify  the  distinction  in  parameters  between  the 
open  and  closed  phases  respectively.  The  intervals  [0,  iV  —  1]  and  [N,  M  —  1]  represent 
the  open  and  closed  phases.  The  phase  angle,  (f)go,  accounts  for  discrepancies  between 
the  actual  glottal  opening  instant  and  the  estimated  opening  instant.  Rewritten 
using  complex  exponentials,  (2)  becomes 


q{n)  = 


CoZ^,  +  C:{z*)\  n  =  0,...,N-l 


c^gc  > 


n  =  iV, . . . ,  M  -  1 


where  *  denotes  a  complex  conjugation  and 


Co  =  a  =  -Ag,, 


gcj  ^gc 


2.2. 1.2  Vocal  Tract  Model.  Krishnamurthy  models  the  vocal  tract 
filter  as  a  pole-zero  system  as  opposed  to  the  all-pole  model  used  in  LPC  synthesis. 
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Furthermore,  the  model  allows  the  transfer  function  to  change  from  closed  phase  to 
open  phase  within  a  pitch  period.  The  filter  transfer  function  over  one  entire  pitch 
period  is  defined  as 


V(z)  = 


VM  = 

V,{z)  = 


Ac(z)’ 


during  closed  phase 
during  open  phase 


where  the  denominator  polynomials  in  each  phase  are  defined  as 

=  lill  -  ^cii)z~'^][l  -  (6) 

i=i 

Mz)  =  lill  -  Zo{i)z-%\  -  (7) 

i=i 

The  parameters  Kc  and  Kg  represent  the  number  of  complex  conjugate  pairs  of  poles 
estimated  in  the  closed  and  open  phases  respectively.  These  pairs  of  poles  represent 
the  formant  resonances  of  the  vocal  tract  and  can  be  used  in  a  formant  synthesizer. 

The  numerators  {Bc{z),  Bo{z)),  or  “zeros”,  of  (5)  contribute  to  the  complex 
amplitudes  of  the  poles. 

2.2. 1.3  Speech  Signal  Model.  Assuming  that  the  effective  voice  source 
and  vocal  tract  filters  are  modeled  as  described  in  the  previous  two  sections,  Krish- 
namurthy  shows  that  the  resulting  speech  signal,  the  output  of  the  vocal  tract  filter, 
is  the  sum  of  exponential  signals: 


Bgo^go  + 


B*goi^*go)  +  j2iBo{i){zoii)r  +  n  =  0, . . . ,  A  -  1 


s(n)  = 


BgcZ^C  + 


n  =  N, . . . ,  M  —  1 
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The  Bo{i)  and  Bc{i)  terms  are  complex  amplitudes  and  Bgo  and  Bgc  are  real  ampli¬ 
tudes. 


2.2.2  Analysis  Procedure.  Krishnamurthy  begins  his  analysis  by  identifying 
the  opening  and  closing  instants  within  each  pitch  period  with  an  electroglottograph 
(EGG)  signal  sampled  simultaneously  with  the  speech  signal.  Once  these  instants  are 
determined,  the  parameters  of  the  vocal  tract  filter  are  estimated  over  each  phase 
in  a  two  step  process.  First,  the  pole  locations  are  estimated  using  a  backward 
prediction  procedure  introduced  by  Parthasarathy,  Kumaresan,  and  Tufts  [18,24]. 
Second,  the  complex  amplitude  parameters  are  estimated  by  solving  a  set  of  linear 
equations  using  the  complex  poles  estimated  in  the  first  step. 

Once  these  parameters  are  estimated,  the  speech  signal  can  be  resynthesized 
using  (8).  Krishnamurthy  presents  results  comparing  short  segments  of  the  original 
voiced  speech  signal  to  a  synthetic  signal  generated  using  this  procedure.  He  has 
shown  that  the  sample-to-sample  difference  between  the  two  waveforms  is  extremely 
low  and  definitely  improved  over  the  one  phase  modeling  of  conventional  speech 
synthesizers.  However,  no  listening  tests  were  performed  to  assess  the  perceived 
quality  by  a  human  subject. 

2.3  Summary 

This  chapter  presented  a  brief  history  of  speech  synthesis  from  early  formant 
and  LPC  synthesizers  to  glottal  pulse-based  systems.  The  literature  is  consistent 
in  identifying  the  need  for  higher  quality,  more  natural  synthetic  speech.  State-of- 
the-art  synthesizers,  while  producing  highly  intelligible  synthetic  speech,  still  cannot 
produce  natural  sounding  speech.  While  models  for  glottal  excitation  commonly  al¬ 
low  for  more  than  one  phase,  few  vocal  tract  models  reported  in  the  literature  allow 
for  multiple  phases  within  a  pitch-period.  The  model  proposed  in  [15]  by  Krishna¬ 
murthy  is  one  model  for  speech  that  does  allow  for  both  a  multi-phase  excitation 
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and  a  multi-phase  vocal  tract.  Krishnamurthy  showed  that  the  difference  between 
synthetic  and  original  signals  can  be  minimized  for  vowel  sounds  using  his  two  phase 
model.  This  evidence  supports  the  idea  that  speech  synthesis  in  general  can  be  im¬ 
proved  with  his  procedure  and  that  further  research  into  applying  it  to  lajger  speech 
segments  is  warranted. 
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III.  Analysis- Synthesis  Systems 

This  chapter  presents  the  methods  used  to  analyze  and  re-synthesize  the  TIMIT 
speech  samples.  Figures  3  and  4  outline  the  processes  used  for  analysis-synthesis 
of  the  damped-exponential  and  LPC  methods  respectively.  These  figures  also  serve 
as  an  outline  for  this  chapter.  First,  the  speech  sample  must  be  decomposed  into 
analysis  frames.  Pitchmarks  (instants  of  glottal  closure)  are  estimated  for  voiced 
speech,  and  segments  of  unvoiced  speech  are  decomposed  into  small  constant  length 
analysis  frames.  Second,  the  phoneme  label  corresponding  to  each  analysis  frame 
is  compared  to  values  in  a  table  to  determine  whether  or  not  the  frame  is  voiced. 
Based  on  the  results  of  the  voicing  analysis  and  the  synthesis  model  (one-phase  LPC, 
two-phase  damped-exponential,  etc.),  the  system  estimates  model  parameters  and 
synthesizes  a  new  waveform.  The  MATLAB®  code  to  implement  these  methods  is 
provided  in  Appendix  B. 

3.1  Pitchmark  Estimation 

The  first  step  in  the  analysis-synthesis  process  is  to  decompose  the  input  speech 
waveform  into  analysis  frames  roughly  equal  to  one  pitch  period.  To  accomplish  this, 
these  analysis-synthesis  systems  use  two  programs  contained  in  the  “Entropic  Signal 
Processing  System”  (ESPS)  from  Entropic  Research  Laboratory,  Inc.  Specifically, 
they  use  the  get  JO  and  epochs  programs  to  generate  a  data  file  containing  non-zero 
values  only  at  estimated  glottal  closing  instants. 

Get  JO  uses  an  algorithm  similar  to  the  method  of  Secrest  and  Doddington  [28]. 
This  algorithm  estimates  the  fundamental  frequency  (/o)  using  the  cross  correlation 
function  and  dynamic  programming.  It  operates  on  a  standard  sampled-data  file 
formated  for  use  with  ESPS  and  produces  a  file  containing  estimates  of:  /o;  a  voicing 
probability;  the  RMS  energy  value;  and  peak  autocorrelation  value.  Epochs  uses 
these  parameters,  the  sampled  data  file,  and  dynamic  programming  to  determine 
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Figure  3  Analysis-Synthesis  procedure  for  the  damped-exponential  model. 
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Figure  4  Analysis- Synthesis  procedure  for  the  LPC  model. 
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the  estimates  of  vocal  fold  closure  instants.  It  produces  a  file  with  impulses  at  the 
estimated  vocal  fold  closure  instants  and  zeros  elsewhere  [4]. 

After  epochs  generates  the  pitchmark  file,  the  analysis-synthesis  system  gen¬ 
erates  a  two-column  matrix  containing  the  starting  point  and  ending  point  of  each 
analysis  frame.  In  areas  where  the  spacing  between  pitchmarks  is  larger  than  the 
expected  maximum  pitch  period  (e.g.  unvoiced  regions),  the  system  divides  the  area 
into  equal  length  analysis  frames.  The  user  selects  the  frequency,  in  hertz,  defining 
the  length  of  these  divisions.  For  this  work,  the  system  used  a  100  Hz  frame  rate  (160 
samples  for  a  16  kHz  sample  rate).  Each  pitchmark  is  moved  to  one  sample  prior 
to  the  nearest  zero  crossing  in  the  speech  signal.  Since  the  pitchmarks  estimated  by 
epochs  typically  occur  at  large  peaks  in  the  speech  file,  estimating  model  parameters 
for  a  frame  from  one  large  peak  to  another  produces  an  unstable  model.  This  move¬ 
ment  stabilizes  the  estimated  model  parameters.  In  addition,  the  movement  of  the 
pitchmarks  reduces  the  error  at  frame  boundaries. 

3.2  Voicing  Determination 

Table  1  contains  a  list  of  each  phoneme  in  the  TIMIT  corpus,  its  class  (vowel, 
nasal,  fricative,  stop,  etc.),  and  a  voiced  or  unvoiced  classification.  This  table  is  used 
to  decide  if  an  analysis  frame  is  voiced  or  unvoiced. 

The  system  must  first  determine  to  what  phoneme  the  current  analysis  frame 
corresponds.  The  system  compares  the  starting  index  of  the  frame  within  the  original 
sentence  to  the  starting  and  ending  indices  in  the  TIMIT  phonemically  labeled  file 
for  the  sentence  under  analysis.  It  then  identifies  the  phoneme  label  corresponding  to 
the  indices  which  contain  the  frame  start  index  as  the  phoneme  under  investigation. 

The  system  then  searches  the  phoneme  table  until  the  phoneme  label  under 
analysis  is  found.  The  corresponding  voiced/unvoiced  classification  is  used  as  the 
voicing  determination. 
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Table  1  Phoneme  Voicing  Determination 


Phoneme 

Class 

Voicing 

Phoneme 

Class 

Voicing 

Vowel 

VOICED 

ih 

Vowel 

VOICED 

Vowel 

VOICED 

ae 

Vowel 

VOICED 

ix 

Vowel 

VOICED 

ax 

Vowel 

VOICED 

ah 

Vowel 

VOICED 

ax-h 

Vowel 

VOICED 

uw 

Vowel 

VOICED 

ux 

Vowel 

VOICED 

uh 

Vowel 

VOICED 

ao 

Vowel 

VOICED 

aa 

Vowel 

VOICED 

ey 

Vowel 

VOICED 

ay 

Vowel 

VOICED 

oy 

Vowel 

VOICED 

aw 

Vowel 

VOICED 

ow 

Vowel 

VOICED 

er 

Vowel 

VOICED 

axr 

Vowel 

VOICED 

m 

Nasal 

VOICED 

em 

Nasal 

VOICED 

n 

Nasal 

VOICED 

nx 

Nasal 

VOICED 

en 

Nasal 

VOICED 

ng 

Nasal 

VOICED 

eng 

Nasal 

VOICED 

1 

Liquid 

VOICED 

el 

Liquid 

VOICED 

r 

Liquid 

VOICED 

y 

Liquid 

VOICED 

w 

Liquid 

VOICED 

hh 

Liquid 

UNVOICED 

hv 

Liquid 

VOICED 

ch 

Fricative 

UNVOICED 

jh 

Fricative 

UNVOICED 

dh 

Fricative 

VOICED 

z 

Fricative 

UNVOICED 

zh 

Fricative 

UNVOICED 

V 

Fricative 

VOICED 

f 

Fricative 

UNVOICED 

th 

Fricative 

UNVOICED 

s 

Fricative 

UNVOICED 

sh 

Fricative 

UNVOICED 

b 

Stop 

VOICED 

d 

Stop 

VOICED 

dx 

Stop 

UNVOICED 

g 

Stop 

VOICED 

P 

Stop 

UNVOICED 

t 

Stop 

UNVOICED 

k 

Stop 

UNVOICED 

pci 

UNVOICED 

tcl 

UNVOICED 

kcl 

UNVOICED 

bcl 

UNVOICED 

del 

Silence 

UNVOICED 

gel 

Silence 

UNVOICED 

epi 

Silence 

UNVOICED 

h# 

Silence 

UNVOICED 

pau 

Silence 

UNVOICED 

q 

Undefined 

UNVOICED 

und 

Undefined 

UNVOICED 
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It  must  be  noted  that  the  analysis-synthesis  systems  developed  here  do  not 
rely  on  the  knowledge  of  the  phoneme  containing  each  frame;  only  the  voicing  of 
each  frame  is  important.  The  systems  use  the  phoneme  identification  since  the 
phonemically  labeled  files  were  conveniently  provided  with  the  TIMIT  data  base. 


Figure  5  Linear  prediction  model  for  a  single  frame  of  unvoiced  speech 


3.3  Unvoiced  Frame  Synthesis 

Noise  driven  LPC  is  used  to  re-synthesize  unvoiced  frames  for  all  synthesis 
systems.  Figure  5  illustrates  the  model  used  to  synthesize  an  unvoiced  frame.  A 
unit  energy  random  noise  generator  provides  the  input  to  an  all-pole,  linear  filter. 
This  input  is  weighted  by  G  to  provide  the  excitation  waveform  Gu{n).  The  filter 
output  at  index  n  can  be  represented  as  a  weighted  sum  of  the  past  p  outputs: 

V 

s{n)  =  ^  ais{n  —  *)  -f  Gu{n)  (9) 

*  =  1 

where  Uj  is  the  linear  predictor  coefficient  and  p  is  the  system  model  order  specified 
by  the  user.  The  synthesis  system  uses  this  equation  to  generate  the  new  unvoiced 
frame  after  all  parameters  have  been  estimated. 

3.3.1  LPC  filter  coefficients.  It  can  be  shown  [26]  that  the  LPC  analysis 
equations  for  the  covariance  method  are 
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(10) 


where 


<^(1,1)  0(1,2)  •••  0(1, p) 

Oi 

^(1,0) 

0(2,1)  0(2,2)  •••  0(2,p) 

0.2 

<^(2.0) 

0(3,1)  0(3,2)  •••  0(3, p) 

03 

= 

^(3,0) 

0(P,1)  0(P,2)  •••  0(p,p) 

Op 

<P(p,  0) 

N-l 

(f){i,  fc)  =  ^  s(m  —  i)s{m  —  k), 

m=0 


i  =  l...p 
k  —  0...p 


Equation  (10)  can  be  rewritten  as 


$a  =  (f> 

The  analysis  system  estimates  a  in  (12)  by 


a  = 


(11) 


(12) 


(13) 


where  1  represents  the  pseudoinverse  operation.  The  pseudoinverse  is  used  instead 
of  the  inverse  to  handle  any  ill-conditioning  of 


3.3.2  Determination  of  gain  (G).  The  gain  (G)  used  to  weight  the  random 
noise  is  defined  as  the  square  root  of  the  minimum  mean-squared  error  between  the 
estimated  filter  output  and  the  original  waveform  {E): 


G  = 


(14) 


where  E  is  expressed  as  [26] 


E  =  s^{m)  s{m)s{m  -  k) 

m=l  fc=l  m=l 

=  0(O,O)-X:ait</.(O,A:) 

fc=i 


(15) 

(16) 
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3-4  Damped-Exponential  Voiced  Frame  Synthesis 

The  damped-exponential  model  represents  the  voiced  waveform  as 

(17) 

where  Ai  and  pi  are  the  i^h  complex  amplitude  and  complex  pole  respectively.  Each 
pole  (if  not  real)  occurs  with  its  complex  conjugate  as  well.  This  conjugate  pairing 
results  in  the  sinusoidal  component  resonating  at  the  pole  frequency.  Appendix  A 
gives  a  detailed  description  of  these  parameters  and  the  method  used  to  estimate 
them. 

The  single-phase  damped-exponential  system  uses  the  same  parameters  for  the 
entire  pitch  period.  The  two-phase  damped-exponential  system  allows  the  parame¬ 
ters  to  change  within  the  pitch  period;  thus  a  transition  point  must  be  estimated. 
After  this  point  is  determined,  the  pitch  period  is  divided  into  two  subframes  corre¬ 
sponding  to  the  first  and  second  phases.  Then,  each  subframe  is  synthesized  using 
the  same  method  as  for  a  single  model  per  pitch  period.  The  second  subframe  is 
appended  to  the  first  to  create  the  total  synthetic  pitch  period. 

3.4-1  Transition  Point  Determination.  Most  speech  synthesis  systems  in 
use  today  only  allow  for  one  set  of  parameters  to  model  the  vocal  tract  resonances 
in  each  pitch  period.  The  systems  developed  in  this  work,  however,  allow  the  pa¬ 
rameters  to  change  once  within  the  pitch  period  in  hopes  of  better  modeling  the 
effect  of  subglottal  to  supraglottal  coupling  when  the  vocal  folds  open.  Typically, 
those  systems  that  do  allow  for  the  parameters  to  change  require  identification  of 
the  opening  instant  using  an  electroglottograph  (EGG)  sampled  simultaneously  with 
the  digitized  speech.  The  analysis-synthesis  systems  presented  in  this  thesis  estimate 
the  point  at  which  the  model  parameters  should  change  directly  from  the  digitized 
waveform. 
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Model  parameters  are  estimated  over  a  small  segment  at  the  beginning  of 
the  pitch  period  (when  the  vocal  cords  are  most  likely  closed)  and  again  over  a 
small  segment  at  the  end  (when  the  vocal  cords  are  most  likely  stUl  open).  The 
systems  synthesize  a  complete  pitch  period  using  each  set  of  model  parameters  and 
evaluate  the  error  between  each  synthesized  frame  and  the  original  frame  to  find  the 
transition  point.  The  error  should  be  small  near  the  area  over  which  the  parameters 
were  estimated  and  larger  elsewhere.  The  plots  of  the  error  between  each  synthesized 
frame  and  the  original  frame  should  cross  at  some  point.  This  point  is  identified  as 
the  parameter  transition  point  within  the  original  pitch  period. 

3. 4. 1.1  Subframe  parameter  estimation  and  synthesis.  The  first  sub- 
frame  (small  segment  at  the  beginning  of  the  pitch  period)  is  defined  as  the  initial 
I  of  the  pitch  period.  This  fraction  was  chosen  in  an  attempt  to  avoid  including 
the  actual  transition  point  in  one  of  the  analysis  windows.  The  second  subframe 
(small  segment  at  the  end  of  the  pitch  period)  is  defined  as  the  last  |  of  the  pitch 
period.  If  the  length  of  either  subframe  is  less  than  or  equal  to  the  model  order 
for  its  corresponding  phase,  the  length  is  extended  to  be  either  1.5  times  the  model 
order  or  80%  of  the  pitch  period,  whichever  is  smaller.  This  amount  of  extension  was 
chosen  to  allow  for  an  adequate  amount  of  data  from  which  to  estimate  the  model 
parameters.  Figure  6  illustrates  a  sample  pitch  period  and  its  two  subframes. 

The  system  estimates  model  parameters  (complex  pole  and  amplitudes)  over 
the  first  subframe  in  the  same  manner  as  described  in  Appendix  A.  However,  instead 
of  generating  a  synthetic  frame  the  same  length  as  the  analysis  window,  it  synthesizes 
a  complete  pitch  period.  Figure  7  illustrates  the  number  of  samples  analyzed  and 
the  number  of  samples  synthesized  for  each  subframe  analysis-synthesis. 

Next,  the  system  must  estimate  a  new  set  of  model  parameters  over  the  second 
subframe.  Once  these  are  complete,  it  also  synthesizes  a  new  complete  pitch  period. 
As  in  the  normal  synthesis  method,  the  initial  sample  index  of  the  analysis  window 
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Figure  6  Plot  of  a  sample  pitch  period  and  the  subframes  of  |  the  length. 
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Figure  7  Illustrations  of  sample  indexing  for  synthesis  using  model  parameters 
estimated  over  the  first  subframe  (a)  and  second  subframe  (b).  The  total 
number  of  samples  synthesized  is  =  145. 
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is  considered  to  be  n  =  0.  However,  since  the  analysis  window  falls  at  the  end  of 
the  pitch  period,  the  system  must  synthesize  the  previous  |  of  the  pitch  period  as 
negative  sample  indices  as  illustrated  by  Figure  7(b). 


3.4-1 -2  Error  analysis  and  transition  determination.  Now  that  the 
two  synthetic  pitch  periods  have  been  generated,  the  system  needs  to  analyze  the 
error  between  each  and  the  original  waveform.  The  error  is  simply  the  normalized 
squared  point-by-point  difference  between  the  original  waveform  and  the  synthetic 
frame: 


e(n)  = 


(s(n)  -  s(n))^ 


(18) 


Peaks  of  the  error  waveform  (e(n))  are  found  using  the  peak  selection  algorithm 
given  in  McMillan  [22].  The  error  amphtudes  between  peaks  are  linearly  interpo¬ 
lated  to  generate  a  smoothed  error  waveform.  Figure  8(a)  shows  the  results  of  this 
procedure  for  the  pitch  period  generated  using  model  parameters  from  the  first  sub- 
frame  of  the  sample  data  in  Figure  7.  Figure  8(b)  shows  the  results  for  the  second 
subframe  synthesis. 


Once  the  system  generates  the  two  smoothed  error  waveforms,  it  identifies  the 
first  crossing  occurring  between  20%  and  60%  of  the  pitch  period  length.  In  [24], 
Parthasarathy  and  Tufts  give  a  typical  range  for  opening  instants  as  20-50%  of  the 
pitch  period.  Empirical  testing  showed  the  extension  of  this  range  to  60%  resulted 
in  better  quality  as  many  transition  points  were  located  between  50%  and  60%. 
Figure  8(c)  shows  the  two  smoothed  error  waveforms  overlayed.  The  estimated 
transition  point  is  clearly  labeled. 


3.5  LPC  Voiced  Frame  Synthesis 

Figure  9  illustrates  the  linear  systems  used  to  model  voiced  pitch  periods  with 
LPC.  Figure  9(a)  shows  that  a  weighted  impulse  at  n  =  1  is  used  as  the  input  to  the 
linear  vocal  tract  filter  when  a  single  phase  per  pitch  period  is  desired.  Figure  9(b) 
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Figure  8  Plots  of  the  error  waveform  (original  and  smoothed)  for  the  synthetic 
waveform  generated  with  parameters  estimated  over  (a)  the  first  sub- 
frame  and  (b)  the  second  subframe,  (c)  is  a  plot  of  the  two  smoothed 
waveforms  overlayed  with  the  estimated  transition  point  labeled. 
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shows  that  the  same  input  is  used  for  a  two-phase  approach,  but  the  filter  parameters 
are  allowed  to  change.  Note  that  the  input  to  the  filter  is  a  sequence  of  zeros  for  the 
second  phase.  Also,  for  voiced  LPC  speech  synthesis,  the  last  p  {p  =  model  order) 
synthesized  samples  of  the  previous  analysis  frame  are  used  as  initial  conditions  for 
the  vocal  tract  filter  in  the  current  analysis  frame. 


(b) 

Figure  9  Linear  prediction  model  for  a  single  pitch  period  of  voiced  speech  using 
(a)  one  phase  per  pitch  period  and  (b)  two  phases. 

The  LPC  analysis-synthesis  systems  estimate  the  parameters  for  the  vocal 
tract  filter  and  gain  in  the  same  manner  as  described  in  section  3.3.  For  a  two-phase 
approach,  the  transition  point  is  estimated  as  described  in  section  3.4.1  and  the  gain 
and  LPC  coefficients  are  re-estimated  in  the  second  frame.  Thus,  the  two-phase 
LPC  system  and  the  two-phase  damped-exponential  system  use  the  same  transition 
points  for  each  sentence  and  speaker;  only  the  voiced  speech  synthesis  model  differs 
between  the  systems. 
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3.6  Summary 

This  chapter  described  the  analysis-synthesis  systems  used  in  this  thesis.  An 
overview  of  the  entire  analysis-synthesis  process  was  illustrated  in  Figures  3  and  4. 
WhUe  many  analysis-synthesis  systems  in  use  today  make  use  of  LPC  and  some 
the  damped-exponential  model,  most  use  only  one  set  of  vocal  tract  parameters 
per  pitch  period.  Some  systems  do  use  two  phases  per  pitch  period,  but  require 
synchronous  EEG  data  to  estimate  the  vocal  tract  parameter  transition  point.  This 
chapter  described  a  method  to  estimate  this  point  directly  from  the  digitized  speech 
sample.  There  are  no  known  analysis-synthesis  systems  in  use  today  that  employ 
this  technique.  The  next  chapter  discusses  the  experiments  that  were  conducted  to 
test  the  relative  merits  of  the  two-phase  damped-exponential  model  as  compared  to 
the  one-phase  LPC,  one-phase  damped-exponential,  and  two-phase  LPC  models. 
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IV.  Experiments  and  Results 

This  chapter  describes  the  design  and  implementation  of  the  subjective  listening  test 
used  to  evaluate  the  algorithms  discussed  in  Chapter  III.  In  particular,  this  chapter 
details  the  test  sentences  used,  the  model  orders  chosen,  and  the  statistical  analyses 
that  were  performed  on  the  data  collected. 

4-1  Subjective  Listening  Test 

A  panel  of  19  volunteers  from  the  Air  Force  Institute  of  Technology  and  the 
USAF’s  Armstrong  Laboratory  (AL)  was  formed  under  AL’s  Human  Use  Review 
Committee  (HURC)  protocol  #83-58:  “Human  Exposure  to  Acoustic  Energy”,  to 
participate  in  a  subjective  listening  test.  All  subjects  were  given  standard  pure-tone 
audiogram  hearing  tests  to  determine  their  suitability  for  the  study;  all  subjects  had 
normal  hearing.  The  subjects  were  not  trained  prior  to  the  test. 

This  test  was  designed  to  answer  the  two  questions  fundamental  to  this  work: 

1.  Does  the  use  of  a  two- phase  damped-exponential  model  in  voiced-speech  syn¬ 
thesis  show  improvement  in  quality  over  that  obtained  with  one-  and  two-phase 
LPC  or  one-phase  damped  exponential  models? 

2.  Does  the  use  of  a  two- phase  model  in  general  show  an  improvement  in  quality 
over  one-phase  models? 

To  answer  these  questions,  statistical  analysis  of  the  test  results  is  needed. 
One  analysis  method  well-suited  for  this  task  is  the  technique  known  as  analysis  of 
variance  (ANOVA).  With  ANOVA  in  mind,  a  two  factor,  within-subjects  listening 
test  was  designed.  The  two  factors  of  interest  are  speaker  variations  and  variations 
between  analysis-synthesis  methods  (treatments). 

Four  speakers  (fdrdl,  fcmhO,  mcmjO,  mchhO)  from  the  TIMIT  data  base  form 
the  4  levels  within  the  speaker  factor.  These  speakers  were  chosen  because  there  are 
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three  common  sentences  uttered  by  each.  Two  females  and  two  males  were  chosen 
to  provide  a  good  gender  balance.  The  three  sentences  used  are: 


1.  sal  “She  had  your  dark  suit  in  greasy  wash  water  all  year.” 

2.  sa2  “Don’t  ask  me  to  carry  an  oily  rag  like  that.” 

3.  sx284  “Jeff  thought  you  argued  in  favor  of  a  centrifuge  purchase.” 

Four  analysis-synthesis  methods  were  applied  to  each  TIMIT  sentence-speaker 
combination.  The  four  methods  used  include  single-phase  damped-exponential, 
single-phase  LPC,  two-phase  damped-exponential,  and  two-phase  LPC.  Each  method 
is  performed  pitch-synchronously  as  was  described  in  Chapter  III.  The  model  orders 
used  by  the  analysis-synthesis  systems  are  18  for  each  voiced  phase  and  unvoiced 
frame.  The  value  of  18  was  chosen  as  a  typical  value  for  data  sampled  at  a  16  kHz 
rate.  Table  2  summarizes  the  speaker-treatment-sentence  combinations  for  this  test. 

Table  2  Two  factor  within-subjects  experiment  block  design 
( D  E= D  amped- Exponential) . 


Treatment 

Speaker 

1  phase  DE 

2  phase  DE 

1  phase  LPC 

2  phase  LPC 

sal 

sal 

sal 

sal 

fdrdl 

sa2 

sa2 

sa2 

sa2 

sx284 

sx284 

sx284 

sx284 

sal 

sal 

sal 

sal 

fcmhO 

sa2 

sa2 

sa2 

sa2 

sx284 

sx284 

sx284 

sx284 

sal 

sal 

sal 

sal 

mcmjO 

sa2 

sa2 

sa2 

sa2 

sx284 

sx284 

sx284 

sx284 

sal 

sal 

sal 

sal 

mchhO 

sa2 

sa2 

sa2 

sa2 

sx284 

sx284 

sx284 

sx284 

The  listening  test  was  conducted  at  Armstrong  Laboratory  at  Wright-Pattt  rs<  n 
AFB,  OH.  The  procedure  followed  for  each  subject  (volunteer  from  the  panel)  w.us: 
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1.  The  subject  entered  the  laboratory,  was  seated  at  a  Sun  SPARCstation  2,  and 
given  verbal  instructions  on  the  testing  process.  Instructions  covered  data 
entry  using  the  keyboaxd  and  when  to  enter  quality  ratings. 

2.  The  subject  was  presented  with  every  speaker-treatment  combination  one  block 
at  a  time.  A  block  consisted  of  a  set  of  three  sentences  from  one  speaker  treated 
with  one  synthesis  method.  There  were  16  blocks  presented  to  the  subject 
during  the  session.  The  subject  was  asked  to  rate  the  quality  of  each  sentence 
on  a  scale  from  1  (poor)  to  5  (excellent).  The  order  in  which  the  blocks  were 
presented  was  randomly  determined  using  a  uniform  random  number  generator 
and  was  diflPerent  for  each  subject.  Within  each  block,  the  sentence  order  was 
also  randomly  determined. 

3.  At  the  beginning  of  each  block  presented  to  the  subject,  four  “anchor”  sen¬ 
tences  were  played.  This  “anchoring”  was  performed  to  provide  lower  and 
upper  bounds  of  quality  ratings  to  the  subject  prior  to  being  asked  to  rate  the 
quality  of  speech.  These  anchor  sentences  provided  two  examples  (one  male 
and  one  female)  of  speech  that  might  be  considered  to  have  a  quality  of  1,  and 
two  examples  (one  male  and  one  female)  of  speech  that  might  be  considered 
to  have  a  quality  of  5.  The  order  of  presentation  was  randomized  between 
poor  and  excellent  examples,  and  then  again  between  male/female  speakers 
within  each  quality  boundary.  The  speakers  and  sentences  chosen  were  differ¬ 
ent  than  those  included  in  the  rating  portion:  faksO  speaking  sentence  sil573 
(“His  captain  was  thin  and  haggard  and  his  beautiful  boots  were  worn  and 
shabby.”),  and  mjlsO  speaking  sil726  (“And  men  also  used  vacuum  cleaners  in 
both  rooms,  sucking  dust  up  once  more.”).  Single-phase,  10*^  order,  LPC  was 
used  to  synthesize  the  poor  quality  examples  while  a  single-phase,  50*^  order, 
damped-exponential  model  was  used  for  the  excellent  quality  examples.  These 
orders  and  models  were  identified  empirically  to  have  the  quality  levels  desired 
for  this  listening  test. 
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4.  After  the  anchors  were  played,  each  of  the  three  sentences  to  be  rated  was 
played  in  random  order.  The  subject  was  asked  to  rate  the  overall  quality  of 
each  of  the  three  sentences. 

5.  After  all  16  blocks  were  presented,  the  test  was  complete  for  that  subject. 

4-2  Listening  Test  Results 

Figure  10  shows  a  plot  of  the  quality  scores  averaged  across  all  19  subjects  from 
each  speaker  and  treatment.  Given  a  quick  look  at  the  plots,  one  might  draw  the 
conclusion  (though  not  completely  accurate)  that  a  two-phase  damped-exponential 
model  performed  the  best  in  general.  However,  because  the  mean  quality  scores  for 
this  treatment  and  one-phase  damped-exponential  model  are  nearly  equal  for  speaker 
mcmjO,  this  conclusion  cannot  be  stated  as  the  true  conclusion  of  the  test.  To  draw 
well-substantiated  claims  from  the  data,  a  statistical  analysis  must  be  performed. 
A  two  factor  within-subjects  analysis  of  variance  (ANOVA)  is  used  to  provide  the 
analysis.  However,  this  type  of  ANOVA  is  not  the  most  appropriate  method  to  use 
for  ordinal  data.  Most  parametric  ANOVA  methods  assume  a  linear  relationship 
of  the  data  (scores)  collected.  The  opinion  scores  collected  in  this  listening  test  do 
not  have  this  linear  relationship;  that  is,  an  opinion  score  of  (5)  is  not  necessarily 
two  times  better  than  a  (2),  and  a  (5)  is  not  necessarily  five  times  better  than  a 
(1).  Although  it  may  not  be  the  most  appropriate  test,  it  stUl  provides  insight  into 
the  relative  performance  of  the  four  synthesis  methods  tested.  An  additional  non- 
parametric  test  called  the  Friedman  two-way  ANOVA  by  ranks  [29]  is  also  conducted 
to  add  credence  to  the  conclusions  drawn  from  the  parametric  analysis. 

4.2.1  Assumptions.  There  are  four  fundamental  assumptions  made  when 
performing  within-subjects  ANOVA. 

1.  Homogeneity  of  within- factor  variances  (i.e.  variances  are  equal). 

2.  Normally  distributed  factor  populations. 
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Figure  10  Quality  scores  averaged  across  subjects  for  each  speaker  and  treatment. 

3.  Independence:  The  individual  factor  and  between  factor  sum  of  squares  “pro¬ 
vide  independent  information  about  the  outcome  of  the  experiment”  [13]. 

4.  The  sphericity  assumption:  The  variances  of  differences  between  paired  treat¬ 
ment  scores  are  equal  throughout  the  population. 

The  fourth  assumption,  sphericity,  is  frequently  violated  in  human  subjective  tests  [13] 
The  effect  of  this  violation  is  to  positively  bias  the  intermediate  result  of  the  ANOVA. 
To  correct  for  this  positive  bias,  a  very  stringent  correction  is  made  to  the  critical 
F-value  chosen.  This  correction,  called  the  Geisser-Greenhouse  correction,  forces  the 
critical  F-value  to  be  /(l,n  —  1)  for  all  F  tests,  where  n  is  the  number  of  subjects 
in  the  within-subjects  ANOVA.  This  correction  factor  was  used  in  aU  ANOVA  tests 
performed  in  this  work. 

4.. 2. 2  Other  considerations.  Certain  phenomena  occur  when  using  human 
subjects  to  perform  subjective  listening  tests.  Two  particular  phenomena  are  the 
practice  effect  and  effects  of  differential  carryover. 

The  practice  effect  concerns  the  fact  that  a  subject  may  show  a  general  im¬ 
provement  in  testing,  or  become  bored  or  fatigued  [13].  The  listening  test  designed 
for  this  work  attempted  to  minimize  this  effect  by  randomizing  the  presentation  or- 
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der  amongst  the  19  subjects.  Thus  each  block  is  presented  at  a  different  location 
within  the  test  so  that  effects  of  training  from  one  subject  will  be  offset  by  the  early 
presentation  of  that  block  to  another  subject. 

The  differential  carryover  effect  concerns  the  effect  an  earlier  presentation  of 
a  treatment  has  on  the  next  presentation  [13].  For  example,  presentation  of  a  poor 
quality  sample  just  prior  to  the  presentation  of  a  moderate  quality  sample  may  cause 
the  rating  of  the  moderate  sample  to  be  inflated.  Presentation  of  a  high  quality 
sample  prior  to  a  moderate  sample  my  deflate  the  rating  of  the  moderate  sample. 
The  re-anchoring  at  the  beginning  of  each  block  of  this  test  was  designed  to  minimize 
this  effect. 

4.2.3  Two  Factor  Within- Subjects  ANOVA.  With  these  assumptions  and 
considerations  in  mind,  the  two  factor  within-subjects  analysis  of  variance  is  per¬ 
formed  as  described  in  [13].  For  this  analysis,  the  mean  score  for  each  subject  in  each 
block  of  the  listening  test  is  used  as  an  observation.  As  mentioned  above,  each  factor 
has  four  levels  of  interest.  Table  3  summarizes  the  ANOVA  performed  on  the  data 
collected  in  the  subjective  listening  tests.  Using  the  Geisser- Greenhouse  correction 
and  significance  level  of  a  =  0.05,  the  critical  F  score  is  /(I,  n  —  \)  —  /(1, 18)  =  4.41. 


Table  3  Summary  of  Analysis  of  Variance 


Source 

SS 

df 

MS 

F 

Speaker  (Sp) 

50.83 

3 

16.94 

20.12 

Treatment  (T) 

21.45 

3 

7.15 

9.36 

Sp  X  T 

12.02 

9 

1.34 

2.41 

Subjects  (S) 

63.88 

18 

3.55 

Sp  X  S 

45.48 

54 

0.84 

T  X  S 

41.25 

54 

0.76 

Sp  X  T  X  S 

89.92 

162 

0.56 

The  first  set  of  hypotheses  to  test  is  for  interactions  between  the  two  factors 
(Speaker  and  Treatment).  Then,  if  the  null  hypothesis  for  the  between  factors  test 
is  not  rejected,  the  within  factor  interactions  can  be  analyzed.  Let  the  hypotheses 
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be: 


H'q  :  =  jj-sp  (be.  between  factors) 

H'a  :  7^  t^Sp 

^0  '  l^fdrdi  ~  l^fcmho  —  /^mcmjo  ~  f^mchho  (be.  between  Speakers) 

H"  :  At  least  two  of  the  iJ^speaker's  are  different 

Hq  :  fJ.DEi  =  fJ-DE2  =  fJ-LPCi  =  IJ-LPC2  (1-6.  between  treatments) 

H"'  :  At  least  two  of  the  /J-treatment  s  are  different 

where  the  o  and  „  subscripts  denote  the  null  and  alternate  hypotheses  respectively. 

In  Table  3,  the  row  for  Sp  x  T  represents  the  summary  of  the  between  factor 
analysis.  Since  f{Sp  x  T)  =  2.41  <  4.41,  the  critical  value,  the  null  hypothesis  H'q  is 
not  rejected  indicating  no  statistically  significant  (at  an  a  =  0.05  significance  level) 
differences  between  the  factor  mean  quality  scores.  Since  this  null  hypothesis  is  not 
rejected,,  we  proceed  to  the  within  factor  variations. 

Since  the  /  scores  for  each  of  the  within  factor  tests  exceed  the  critical  value 
of  4.41,  the  corresponding  null  hypotheses  {H'q  and  H'q)  are  rejected.  This  rejection 
indicates  there  are  statistically  significant  (again  at  the  a  =  0.05  level)  differences 
in  the  mean  quality  scores  among  speakers,  and  also  among  synthesis  treatments. 

4-2.4  Tukey  test.  Since  the  ANOVA  showed  statistically  significant  dif¬ 
ferences  in  the  mean  quality  scores  within  factors,  a  Tukey  test  [13]  is  performed 
to  analyze  these  differences.  This  test  simply  compares  all  pairwise  differences  in 
means  within  a  given  factor. 

4.2.4. 1  Within- Speaker  Analysis.  Table  4  summarizes  the  mean  qual¬ 

ity  score  differences  for  the  speaker  factor  after  performing  the  first  step  of  the  Tukey 
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test.  The  entries  below  the  main  diagonal  of  the  table  are  omitted  as  they  are  mirror 
images  of  the  entries  above  it. 


Table  4  Pairwise  differences  of  means  for  Speaker  factor 


Speaker 

Speaker  Means 

mcmjO  mchhO  fdrdl  fcmhO 
2.2237  2.5877  2.9912  3.3070 

mcmjO  2.2237 
mchhO  2.5877 
fdrdl  2.9912 
fcmhO  3.3070 

—  0.3640  0.7675  1.0833 

—  0.4035  0.7193 

—  0.3158 

The  second  step  is  to  calculate  the  minimum  pairwise  mean  difference  that 
must  be  exceeded  to  show  statistically  significant  differences.  This  threshold  value 
is  defined  as  [13] 

where  qT  is  an  entry  in  the  table  of  the  studentized  range  statistic,  MSs/a  is  the 
error  term  from  the  overall  ANOVA,  and  n  is  the  number  of  samples  for  each  level 
under  analysis  in  the  Tukey  test. 

For  this  work,  qi  =  3.76  {df error  =  54,  fc  =  4)  as  shown  in  Table  A-5  in  [13]. 
From  Table  3,  the  error  term  is  the  MS  for  SP  x  S  =  0.84.  Finally,  the  number  of 
samples  for  each  speaker,  n,  is  76  (19  subjects  and  4  treatments  per  speaker).  The 
resulting  dr  is  0.3958. 

Only  the  differences  in  means  between  each  female  speaker  and  each  male 
speaker  exceed  the  threshold  value  of  0.3958.  Therefore,  the  conclusion  drawn  from 
this  Tukey  test  is  that  speakers  fdrdl  and  fcmhO  had  statistically  significant  (at  the 
a  =  0.05  level)  higher  mean  quality  scores  than  both  mcmjO  and  mchhO  for  the 
sentences  and  model  orders  chosen  for  this  thesis. 

4-2. 4-2  Within-Treatment  Analysis.  Table  5  summarizes  the  mean 
quality  score  differences  for  the  treatment  factor. 
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Table  5  Pairwise  differences  of  means  for  Treatment  factor 


Treatment 
Treatment  Means 

DEI  LPCl  LPC2  DE2 
2.5132  2.6623  2.7149  3.2193 

DEI  2.5132 

LPCl  2.6623 

LPC2  2.7149 

DE2  3.2193 

—  0.1491  0.2018  0.7061 

—  0.0526  0.5570 

—  0.5044 

The  threshold  value  is  calculated  as  in  (19)  with  the  same  values  for  qr  and  n 
as  for  the  speaker  comparisons,  but  the  error  term  MSs/a  is  0.76  versus  0.84  in  the 
speaker  comparisons.  The  resulting  threshold  value  is  dy  =  0.3770. 

Only  the  differences  in  means  between  the  two-phase  damped-exponential 
treatment  (DE2)  and  all  other  treatments  exceed  the  threshold  value  of  0.3770. 
Therefore,  the  conclusion  drawn  from  this  Tukey  test  is  that  the  two-phase  damped- 
exponential  synthesis  method  resulted  in  statistically  significant  (at  the  a  =  0.05 
level)  higher  mean  quality  scores  than  all  other  treatment  methods  used  for  the 
sentences,  speakers,  and  model  orders  chosen  for  this  thesis. 

4.2.5  Non-Parametric  Analysis  Results.  Analysis  of  the  subjective  listen¬ 
ing  test  data  using  the  Friedman  two-way  analysis  of  variance  supports  the  conclusion 
that  the  two-phase  damped-exponential  model  for  speech  synthesis  performs  better 
than  the  other  synthesis  methods  tested.  The  Friedman  test  showed  that  the  two- 
phase  damped-exponential  synthesis  method  was  never  outperformed  by  the  other 
methods  tested.  This  conclusion  lends  creedance  to  the  conclusions  drawn  from  the 
two-factor  within-subjects  ANOVA  and  Tukey  tests  performed  above. 

4.3  Summary 

This  chapter  described  the  experiments  and  statistical  analysis  performed  to 
test  the  analysis-synthesis  systems  described  in  Chapter  III.  It  presented  the  subjec¬ 
tive  listening  test  performed  at  Armstrong  Laboratory  and  the  analysis  of  variance 
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and  Tukey  tests  used  to  draw  conclusions  of  statistical  significance  from  the  data 
collected.  These  tests  indicate  statistically  significant  (at  the  a  =  0.05  level)  higher 
mean  quality  scores  for  females  across  all  synthesis  methods.  Qualities  of  the  male 
speakers  such  as  higher  pitch,  softer  voices,  etc.  may  have  contributed  to  this  find¬ 
ing.  The  tests  also  show  that  the  two-phase  damped-exponential  model  resulted  in 
statistically  significant  (at  the  a  =  0.05  level)  mean  quality  scores  across  speakers 
than  all  other  synthesis  methods  used.  These  results  were  reinforced  using  the  non- 
parametric  Friedman  test.  The  next  chapter  will  summarize  these  results  and  decide 
whether  or  not  the  goal  of  this  thesis  was  achieved. 
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V.  Conclusions  and  Recommendations 

As  mentioned  in  Chapter  I,  the  purpose  of  this  thesis  is  to  answer  the  questions: 

1.  Does  the  use  of  a  two- phase  damped-exponential  model  in  voiced-speech  syn¬ 
thesis  show  improvement  in  quality  over  that  obtained  with  one-  and  two-phase 
LPC  or  one-phase  damped-exponential  models? 

2.  Does  the  use  of  a  two-phase  model  in  general  show  an  improvement  in  quality 
over  one-phase  models? 

This  chapter  uses  the  results  of  the  subjective  listening  test  and  statistical  analyses 
presented  in  Chapter  IV  to  determine  the  answers.  Additionally,  several  areas  in 
which  this  thesis  could  be  applied  for  further  study  are  discussed. 

5.1  Conclusions 

The  ANOVA  and  Tukey  tests  of  Chapter  IV  do  show  a  statistically  significant 
(at  the  a  =  0.05  level)  increase  in  mean  quality  scores  for  the  speech  synthesized 
using  the  two-phase  damped-exponential  model  for  voiced  speech  over  the  other 
methods  tested.  However,  since  only  four  speakers,  four  treatment  methods,  and 
three  sentences  were  used,  more  thorough  tests  are  recommended  to  more  conclu¬ 
sively  address  the  primary  question.  The  answer  to  the  second  question  above  is 
that  the  potential  for  quality  improvement  of  a  two-phase  model  in  general  was  not 
shown  to  be  significant  in  this  particular  test.  Therefore,  the  conclusions  are: 

1.  Yes,  a  two-phase  damped-exponential  model  does  show  improvement  in  qual¬ 
ity  over  some  other  synthesis  methods  in  use  today  for  the  limited  data  set 
examined.  Further  testing  is  required. 

2.  No  significant  quality  improvement  could  be  shown  for  a  two-phase  model  in 
general  over  one-phase  models.  Again,  however,  limited  testing  was  done  and 
perhaps  further  testing  would  show  a  significant  improvement. 
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5.2  Potential  areas  for  further  research 

With  the  demonstrated  potential  for  improvement  in  quality  of  synthesis  using 
a  two-phase  damped-exponential  model  for  voiced  speech  comes  the  possibility  to 
improve  performance  in  a  wide  variety  of  speech  applications.  Before  covering  some 
of  these  applications,  it  must  be  noted  that  this  speech  model  is  inappropriate  for 
any  application  where  a  very  low  bit-rate  is  desired.  Each  increase  in  model  order  in 
this  method  results  in  the  addition  of  2  parameters  (complex  pole  and  amplitude), 
each  represented  by  a  real  and  imaginciry  number.  In  addition,  since  two  sets  of 
parameters  axe  estimated  over  each  pitch  period  (analysis  frame),  there  are  twice 
the  amount  of  parameters  per  frame  as  in  conventional  methods.  Thus,  the  number 
of  values  required  to  represent  each  frame  of  speech  increases  by  8  for  every  increase 
in  model  order. 

5.2.1  Speaker  Identification.  Speaker  identification  systems  in  use  today 
incorporate  a  wide  variety  of  classification  techniques  using  numerous  types  of  fea¬ 
tures.  Some  features  commonly  extracted  from  the  speech  for  use  with  the  classifier 
are  mel  frequency  cepstral  coefficients,  linear  prediction  (LP)  coefficients,  LP  cep- 
stral  coefficients,  and  transitional  coefficients  [19,25].  Another  set  of  parameters 
worthy  of  examination  for  possible  classification  improvement  are  the  complex  poles 
and  amplitudes  extracted  in  the  two-phase  damped-exponential  analysis.  While  the 
increase  in  the  number  of  parameters  used  to  represent  a  frame  of  speech  is  detri¬ 
mental  in  very  low  bit-rate  coding  applications,  this  increase  could  be  beneficial  in  a 
speaker  identification  problem.  Whether  or  not  these  parameters  form  a  good  feature 
set  for  speaker  identification  is  not  known  leading  to  the  need  for  further  research. 
Furthermore,  the  discriminative  power  of  features  from  within  a  pitch  period  has  yet 
to  be  explored. 

5.2.2  Speech  Synthesis.  This  thesis  has  shown  the  potential  to  improve 
the  quality  of  synthetic  speech  by  using  a  two-phase  damped-exponential  model. 
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However,  a  significant  improvement  in  quality  was  not  deduced  from  the  statistical 
analysis  for  a  two-phase  model  in  general.  Perhaps  further  testing  would  show  that 
statistically  significant  improvement  or  lead  to  improvements  of  the  methods  pre¬ 
sented  here.  In  addition,  many  of  the  subjects  on  the  testing  panel  mentioned  the 
presence  of  “noise”  and  other  artifacts  even  though  the  fundamental  speech  quality 
was  very  good.  Research  into  possible  ways  to  remove  this  noise  and  the  remaining 
artifacts  may  also  improve  the  quality  of  synthetic  speech.  Another  improvement  to 
be  made  may  be  in  the  area  of  determining  the  phase  transition  point.  Only  one 
method  was  presented  here.  Improvements  to  this  method  could  be  made  or  a  new 
method  developed  and  used. 

5.3  Thesis  Summary 

This  work  has  presented  evidence  that  the  use  of  a  two- phase  damped-exponential 
model  for  voiced  speech  synthesis  provided  statistically  significant  (at  the  a  =  0.05 
level)  improvement  in  the  quality  of  synthetic  speech  relative  to  single  phase  LPC, 
single  phase  damped  exponential,  and  two-phase  LPC  techniques  for  the  data  set 
tested.  Furthermore,  this  evidence  was  obtained  using  a  new  method  for  determining 
the  phase  transition  point  without  the  use  of  EGG  data.  The  results  of  the  tests  ac¬ 
complished  are  promising  and  show  the  need  for  further  testing  and  research  of  this 
model  to  strengthen  the  conclusions  and  lead  to  improvements  for  the  algorithms 
developed  here.  In  addition,  research  into  the  application  of  this  model  to  other 
areas  of  speech  technology  may  give  promising  results. 
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Appendix  A.  Synthesizing  With  a  Damped- Exponential  Model 

This  appendix  details  the  analysis-synthesis  procedure  for  a  single-phase  damped- 
exponential  model. 

A.l  Synthesis  Equation 

The  equation  used  to  synthesize  one  segment  of  speech  is: 

R-l 

s[n]=Y,Aip7^  n  =  0,...,N-l  (20) 

i=0 

where  N  is  the  number  of  speech  samples  to  synthesize,  R  is  the  number  of  complex 
poles  estimated,  Ai  is  the  complex  amplitude  for  the  pole,  and  pi  is  the  complex 
pole. 

Since  each  of  the  parameters  are  complex  numbers  in  general,  they  can  be 
written  as: 

Ai  —  Bie^^‘,  and  (21) 

Pi  =  Cie^'^^  (22) 

where  Bi,  Ci,  0i,  and  a;,  are  real  numbers.  Substituting  (21)  and  (22)  into  (20) 
yields: 

R-l 

s[n]  =  X)  (23) 

i=0 

or 

R-l 

s[n]  =  Zie>^^'+'^'\  (24) 

j=0 

where  Zi  =  BiCf.  Given  that  the  poles  estimated  are  real  or  in  complex  conjugate 
pairs,  it  becomes  apparent  from  (24)  that  the  synthetic  waveform  is  simply  a  weighted 
sum  of  sinusoids  which  are  exponentially  damped  with  the  damping  controlled  by 
the  Cf  term. 
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A.^  Pole  Estimation 


The  complex  poles  are  estimated  using  linear  prediction  in  the  backward  time 
direction  and  singular  value  decomposition  (SVD).  The  reason  for  the  backward  time 
analysis  is  that  the  waveform  over  one  pitch  period,  when  reversing  the  time  indices, 
will  appear  to  be  unstable,  and  thus  have  unstable  modes.  Therefore,  modes  that 
are  truly  stable  in  the  waveform  will  appear  to  be  unstable  due  to  the  time  reversal 
and  the  poles  will  have  a  magnitude  greater  than  one.  Extraneous  poles  appear 
inside  the  unit  circle  in  the  complex  plane  [17,18,24].  In  practice,  some  poles  may 
be  estimated  as  slightly  unstable  and  appear  shghtly  inside  the  unit  circle  before 
reflection.  Therefore,  poles  having  a  magnitude  greater  than  a  value  selected  by  the 
user  (typically  0.87-0.9  determined  empirically)  are  retained  and  reflected  about  the 
unit  circle.  The  next  sections  describe  the  estimation  algorithm  in  more  detail. 

A. 2.1  Linear  Prediction  Equations.  The  equation  to  solve  is: 


Ax  =  -h 


where  for  L  estimated  poles  and  N  samples  in  the  analysis  frame  y: 


and 


6(1) 

6(2) 

6(L) 


[h|A]  = 


Vo 

yi 

y2 

•••  yi 

Vi 

y2 

yz 

■  ■  ■  yL+i 

y2 

yz 

yi 

• • •  yL+2 

Un-l-i 

yN~L 

yN-L+i 

•  •  •  yN-i 

(25) 


(26) 


(27) 
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Equations  (25-27)  show  that  the  each  data  point  is  modeled  as  a  weighted  sum  of 
the  future  L  samples. 

The  least  squares  solution  of  (25)  for  x  is 

X  =  -A^h  (28) 

This  form  of  the  equation  is  the  one  we  will  use  to  estimate  the  poles  in  MATLAB®. 
The  vector  x  contains  the  coefficients  for  the  prediction  polynomial  [18] 

B{z)  =  1  -h  b{l)z'^  +  b{2)z-^  -!-•••  +  b{L)z-^  (29) 

from  which  the  complex  poles  are  determined. 

A. 2. 2  Solving  for  x.  Typically  there  are  fewer  true  filter  poles  (M)  than 
the  number  estimated  (L),  (i.e.  L  >  M).  To  alleviate  Hi-conditioning  of  A,  the  SVD 
of  [h|A]  is  used  [18].  During  this  process,  small  singular  values  (<  10"®)  are  set  to 
zero  effectively  increasing  the  SNR  of  the  data  prior  to  solving  for  x. 

Using  the  SVD,  A  can  be  represented  as 

A  =  USV^  (30) 

where  U  is  a  (iV  —  L)  x  (iV  —  L)  unitary  matrix  of  the  “left”  singular  vectors, 
S  is  a  {N  —  L)  X  L  matrix  of  nonnegative  real  singular  values,  and  is  the 
hermitian  (conjugate  transpose)  of  the  Lx  L  matrix  of  “right”  singular  vectors  [31]. 
As  mentioned  previously,  the  small  real  singular  values  in  S  are  set  to  zero.  Thus, 
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instead  of  using  S: 


we  use 


CTi  0  ■■■  0 

0  (72  •  •  •  0 

•  •  a  * 

S=  0  0  •••  c7£ 

0  0  •••  0 

0  0  •••  0 


(N-L)xL 


(Tl 

0 

0 

0 

...  0 

0 

(^2 

0 

0 

...  0 

• 

0 

• . 

• 

• 

0 

0 

<^R 

0 

...  0 

0 

0 

0 

0 

...  0 

; 

: 

0 

1 

•. .  1 

0 

0 

0 

0 

0 

...  0 

where  R  <  L.  Substitution  into  (25)  yields: 


{N-L)xL 


USV  x  =  -h 


Solving  for  x  yields: 


X  =  -VE^U^h 
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where 


0 


0  0  0  .  0 


—  0 

0 

■  •  •  •  4  « 

•  0 

<^2 

0  ••• 

1 

1 

; 

0  ••• 

1 

0  ••• 

0 

0  ••• 

0 

0 

0 

0  ; 


0  0  0  0  •••  0 

-‘(JV-i)xL 


(36) 


A. 2. 3  p  determination.  Now  that  x  is  known  (i.e.  the  coefficients  of  the 
prediction  polynomial),  we  can  use  MATLAB®’s  “roots”  function  to  determine  the 
complex  vector  of  estimated  true  and  extraneous  poles,  0.  The  magnitude  of  each 
complex  pole  is  examined  to  identify  those  poles  that  are  to  be  retained.  If  the 
magnitude  of  the  i*^  element  of  0  is  less  than  the  threshold,  the  pole  is  rejected.  If 
the  magnitude  is  greater  than  the  user  selected  threshold,  the  conjugate-reciprocal 
(^)  is  retained.  This  process  reflects  the  pole  about  the  unit  circle  in  the  z-plane. 
The  resultant  vector,  p,  contains  each  of  the  estimated  true  poles  for  the  vocal  tract 
filter. 


A. 3  Determination  of  Ai ’s  in  synthesis  equation 

Next  we  need  to  determine  how  much  each  pole  contributes  to  the  waveform. 
The  .Aj’s  in  (20)  represent  these  contributions.  To  estimate  them,  we  use  a  matrix 
form  of  (20)  with  the  original  data  sequence  y  replacing  the  estimated  sequence  M  n): 

y  =  Aa  i'M]) 
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or 


Solving  for  a  yields; 

a  =  AV  (38) 

which  is  easily  solvable  in  matlab®  using  the  “pinv”  routine. 

Now  that  the  Ai’s  and  p^’s  from  (20)  have  been  determined,  the  new  frame  can 
be  synthesized. 

A. 4  An  Example 

As  an  example,  let’s  use  the  frame  of  data  illustrated  in  Figure  11(a).  This  data 
is  one  pitch  period  extracted  from  the  phoneme  /iy/  (as  in  “she”)  in  the  sentence 
“sal”  uttered  by  speaker  “mefgO”  in  the  TIMIT  data  base.  The  model  parameters 
axe  L  =  18,  iV  =  145,  and  threshold  —  0.87. 


(c) 

Figure  11  (a)  One  pitch  period  from  the  phoneme  /iy/  as  in  “she”,  (b)  Re¬ 

synthesized  waveform,  (c)  Plot  of  the  error  between  the  two. 
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Figure  12  (a):  The  estimated  true  and  extraneous  poles  (0),  (b):  The  true  poles 

retained  and  reflected  about  the  unit  circle  (p). 


Using  these  parameters  and  data,  the  resulting  0  is  as  shown  in  Table  6.  Fig¬ 
ure  12(a)  shows  the  location  of  these  poles  on  the  z-plane  with  a  solid  ring  designating 
the  0.87  threshold.  It  can  be  seen  from  the  magnitudes  that  only  13  elements  with 
a  magnitude  greater  than  0.87  will  be  retained.  The  remaining  five  will  be  rejected. 

Reflecting  the  first  13  poles  about  the  unit  circle  yields  the  estimated  pole 
vector,  p,  shown  in  the  second  column  of  Table  6  and  illustrated  in  Figure  12(b). 

Using  p  and  the  original  data  vector,  the  complex  amplitude  vector,  a,  (shown 
in  the  right  hand  column  of  Table  6)  is  estimated.  Examination  of  the  elemental 
magnitudes  of  this  vector  illustrates  an  important  feature  of  this  analysis-synthesis 
method.  For  this  example,  the  significant  modes  (a  mode  being  a  pole/amplitude 
pair)  are  those  with  complex  magnitudes  greater  than  or  equal  to  1.8441.  These 
modes  correspond  to  the  true  modes  as  illustrated  by  the  LPC  spectra  in  Figure  13. 
Those  modes  with  smaller  complex  amplitudes  correspond  to  a  DC  offset  and  an 
extraneous  mode;  each  of  these  modes  would  be  rejected  by  extending  the  radius 
threshold  to  0.96.  However,  because  they  are  kept,  their  impact  on  the  synthesized 
waveform  is  minimized  by  their  small  complex  amplitudes  in  a.  These  modes  are  not 
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Table  6  Estimated  parameters.  Frequency  figures  are  based  on  a  16  kHz  sample 
rate. 


0 

P 

freq  (Hz) 

a 

0.1513  L 

0.0000° 

0.9513  L 

0.0000° 

1.0512  L 

0.0000° 

0 

0.2244  L 

180.0000° 

1.0088  L 

7.6475° 

0.9912  L 

7.6475° 

340 

434.6135  L 

91.2425° 

1.0088  L 

-7.6475° 

0.9912  L 

-7.6475° 

-340 

434.6135  Z 

-91.2425° 

1.0140  L 

43.1274° 

0.9862  L 

43.1274° 

1916.8 

109.0635  Z 

80.3428° 

1.0140  L 

-43.1274° 

0.9862  L 

-43.1274° 

-1916.8 

109.0635  Z 

-80.3428° 

0.9834  L 

59.5625° 

1.0169  L 

59.5625° 

2647.2 

1.8441  Z 

-123.4215° 

0.9834  Z 

-59.5625° 

1.0169  L 

-59.5625° 

-2647.2 

1.8441  Z 

123.4215° 

0.9424  L 

79.7025° 

1.0611  L 

79.7025° 

3542.3 

0.0073  Z 

8.4519° 

0.9424  L 

-79.7025° 

1.0611  L 

-79.7025° 

-3542.3 

0.0073  Z 

-8.4519° 

0.9969  L 

93.0193° 

1.0031  L 

93.0193° 

4134.2 

8.1502  Z 

153.4833° 

0.9969  L 

-93.0193° 

1.0031  L 

-93.0193° 

-4134.2 

8.1502  Z 

-153.4833° 

0.7818  L 

124.0899° 

0.7818  L 

-124.0899° 

1.0021  L 

144.9657° 

0.9979  L 

144.9657° 

6442.9 

6.3134  Z 

96.8941° 

1.0021  L 

-144.9657° 

0.9979  L 

-144.9657° 

-6442.9 

6.3134  Z 

-96.8941° 

0.8689  L 

159.9182° 

0.8689  L 

-159.9182° 

apparent  in  the  LPC  spectrum.  Solving  (20)  results  in  the  synthetic  waveform  shown 
in  Figure  11(b).  The  error  between  the  two  waveforms  is  illustrated  in  Figure  11(c). 


A. 4-1  Pole  Rejection  Analysis.  If  we  were  to  reject  the  poles  at  3.542 
kHz  and  DC,  what  would  happen?  Figure  14  shows  (a)  the  original  waveform,  (b) 
the  reconstructed  waveform  including  those  modes,  (c)  the  reconstructed  waveform 
excluding  those  modes,  and  (d)  the  error  between  the  two  synthetic  waveforms. 

Figure  15  shows  the  new  LPC  spectrum  overlayed  on  the  previous  two.  With 
the  absence  of  the  3.542  kHz  mode,  there  is  a  sharper  dip  in  the  spectrum  at  this 
point,  causing  an  ill-match  of  the  spectrum.  So  even  though  the  mode  wasn’t  ap¬ 
parent  in  the  spectrum  containing  all  the  poles,  it  was  still  an  important  component 
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Figure  13  Plot  of  the  LPC  spectrum  for  the  original  and  synthetic  waveforms. 


Figure  14  (a)  Original  waveform,  (b)  Re-synthesized  waveform  including  DC  and 

3.542  kHz  modes,  (c)  Plot  of  the  reconstructed  waveform  excluding  the 
two  modes,  (d)  The  error  between  the  two  synthetic  waveforms  of  (b) 
and  (c). 
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from  a  spectral  perspective.  However,  the  error  between  the  two  synthetic  waveforms 
shown  in  Figure  14(d)  is  nearly  zero  over  much  of  the  frame.  Since  it  is  nearly  im¬ 
possible  to  make  any  quality  assessments  on  hearing  a  single  pitch  period  of  speech, 
the  impact  of  rejecting  the  mode  cannot  be  empirically  determined  from  a  single 
pitch  period  of  speech  through  a  listening  test. 


Figure  15  Plot  of  the  LPC  spectrum  for  the  original  and  synthetic  waveforms 
including  the  spectrum  without  poles  at  DC  and  3.542  kHz. 


49 


Appendix  B.  MATLAB®  Code  and  Support  Files 

For  more  information  or  to  obtain  copies  of  any  or  all  code,  e-mail:  harb@aiit.af.mil  or 
mdesimio@afit.Eif.mil. 


B.l  Single-Phase  Damped-Exponential  Master  Routine 


'/,  function  Cy,origpoles,savepoles,numv,numuv]=onephasedex(ifile,pfile, . . . 
*/,  phnf  ile ,  s  ingord ,  uvord ,  f  s ,  radius  .maxper ,  uvpit  ch) 

7. 


7.  Function:  onephasedex.m 

7.  Description:  This  function  analyzes  the  incoming  data  ("data")  using 
7.  exponentially  damped  sinusoids  as  the  model  for  the  speech  waveform. 

7.  It  performs  sinusoid  frequency  (pole)  analysis  and  complex  damping 
7.  factor  analysis  on  a  pitch-synchronous  level.  The  synthetic  waveform 
7i  is  the  sum  of  the  exponentially  damped  sinusoids. 

7. 


7.  Author:  Capt 
7t  Date:  29  Jul 
7.  Modified:  30 

7. 

7.  2 

7.  6 

7. 

7.  12 

7. 

7. 


A1  Arb,  USAF 

96 

Jul  96  —  Added  comments,  pre-emphasis  for  unvoiced  frames, 
removed  "energy"  stuff. 

Aug  96  —  Added  initial  conditions  for  unvoiced  frames. 

Aug  96  —  Modified  code  to  read  in  binary  data  vice  having  it 
passed  as  an  input  parameter. 

Aug  96  —  Modified  function  header  to  return  the  original 
and  saved  poles  as  well  as  the  number  of  voiced 
and  unvoiced  frames. 


7.  Input  parameters: 

7.  if  ile:  The  original  speech  signal  data  file  name. 

7.  pfile:  The  name  of  the  binary  data  file  containing  the  pitch  marks. 

7.  Each  pitch  mark  should  be  some  non-zero  number. 

7#  phnf ile:  The  name  of  the  hand-labelled  phoneme  file  for  the  speaker  and 
7.  sentence  corresponding  to  "data". 

7t  singord:  The  order  for  the  pole  determination  analysis  to  be  performed 
7.  over  the  voiced  portion  of  "data".  This  number  corresponds 

7.  to  the  number  of  complex  poles  estimated.  If  this  is  an  even 

7.  number,  there  will  be  singord/2  complex  conjugate  pairs  of 

7.  poles. 

7.  uvord:  The  order  for  the  LPC  analysis  to  be  performed  over  the  unvoiced 
7.  portion  of  "data". 

7.  fs:  The  sampling  frequency. 

7.  radius:  The  minimum  radius  on  the  z-plane  for  accepting  poles  predicted 
7.  using  the  "backward"  linear  prediction  method. 

7.  maxper:  The  maximum  allowable  period  for  voiced  speech.  Used  in 
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*/•  determining  pitch  marks  when  there  eire  l2u:ge  gaps. 

*/,  uvpitch:  The  desired  frame  rate  during  long  periods  (>  maxper)  of 
'/,  unvoiced  speech. 

7. 

%  Output  Parameter: 

7,  y:  The  synthetic  waveform. 

7.  origpoles:  A  matrix  whos  columns  contain  the  complex  poles  as  estimated 
7.  using  the  backward  prediction  process  for  voiced  frames. 

7.  This  matrix  contains  all  'order*  poles  prior  to  selection/ 

7.  reflection. 

%  savepoles:  A  matrix  whos  columns  contain  the  selected  euid  reflected 
7.  complex  poles  estimated  using  the  backwaird  prediction  process 

%  for  voiced  frames. 

7.  numv:  The  number  of  voiced  frames. 

7.  numuv:  The  number  of  unvoiced  frames. 

7. 

7.  Subroutines  directly  called: 

7t  read_dat .  m 

7t  newpmrks .  m 

7.  readphn.m 

7.  loadphndb.m 

7.  synthlpc.m 

7.  synthv .  m 

7.  detvoice.m 

7. 

7.  Subroutines  indirectly  called: 

7.  covlpc.m 

%  getCoef.m 

7i  get  Amps,  m 

7i  reconst,  m 

7.  diffeq.m 

7. 

f unct ion  [y , origpoles , savepoles , numv , numuv] =onephasedex ( if ile , pf ile , phnf ile 
, s ingord , uvord , f  s , radius , maxper , uvpit  ch) 

7. 

7.  read  in  data  and  pitch  marks 

7. 

data=read_dat (if ile , 'short') ; 
pmarks=read_dat (pf ile , ' short ' ) ; 

7.  re-calculate  pitch  marks  back  to  a  zero-crossing 
ne wmarks=ne wpmrks (dat  a , pmarks , f  s , maxper , uvpit  ch) ; 
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•/. 

7,  initialize  some  variables 

7. 

nuinframes=length(newmarks( :  ,1)) ; 

y=n ; 

maxord=max( Csingord;uvord] ) ; 
origpoles=zeros(maxord(l)  ,1)  ; 
savepoles=origpoles ; 
numuv=0 ; 
numv=0 ; 


7. 

7.  read  in  phoneme  file 

7. 


[phnind.phnval] =readphn(phnf ile) ; 
phndb=loadphndb ; 


7. 

7.  Loop  through  each  pitch  period  generated  with  epochs  amd  "newpmarks" 

7. 

for  i=l:numframes, 

samples=data(newmarks(i,l) ;newmarks(i,2)) ; 

7. 

7.  identify  which  phoneme  we’re  in 

7. 

cur_phoneme=f ind ( (phnind ( : , 1 ) <=newmarks ( i , 1) ) & (phnind ( : , 2 ) >=newmarks ( i , 1 ) ) ) ; 

7. 

7i  Make  crude  voicing  determination  using  phoneme  file. 

7. 

vuv=detvoice(phnval(cur_phoneme,l:4) ,phndb) ; 
if  vuv 

numv=numv+l ; 

7. 

7i  Synthesize  new  frame. 

7. 

[newf rame , orig , sv] =synthv (samples , singord , radius) ; 

7. 

7.  Save  the  poles  originally  estimated  and  those  retained  aifter 
7.  selection/reflection. 

7. 

origpoles=Corigpoles  [orig;zeros(length(origpoles( : ,1) )-length(orig) , 1)] ] ; 
savepoles= [savepoles  [sv;zeros(length(savepoles( : , l))-length(sv) ,1)]] ; 

7. 
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y.  Add  new  frame  to  synthetic  vector 

7. 

y=[y;newframe] ; 
else 

•/. 

7,  Unvoiced  portion  of  the  code. 

7. 

numuv=numuv+l ; 

7. 

7.  Synthesize  unvoiced  frame.  There  is  pre-emphasis.  Initial  conditions  are 
7«  assumed  to  be  zero  as  it  is  less  important  for  unvoiced  segments. 

7. 

7,  Get  past  p=singord  values  of  synthesized  speech  for  use  in  LPC  auialysis.  If 
7«  we  are  on  the  first  frame,  assume  initial  conditions  are  zero. 

7. 

if  i==l , 

past=zeros(uvord, 1) ; 
else 

past=y ( length (y) -uvord+1 : length (y) ) ; 
end; 

newframe*synthlpc (samples ,uvord,past ,0) ; 


7. 

7«  Add  on  new  frame 

7. 

y=[y;newframe] ; 
end; 

end;  7,  for 
save  onephasedex 

B.2  Two-Phase  Damped- Exponential  Master  Routine 

7.  function  [y ,phspnt .origpoles ,savepoles,numv,numuv]=twophasedex(if ile ,pf ile , . . . 
7.  phnf  ile ,  iniord ,  secord ,  uvord ,  f  s ,  radius ,  maxper ,  uvpit  ch) 

7. 

7,  Function:  twophasedex.m 

V,  Description:  This  function  analyzes  the  incoming  data  ("data")  using 
7,  exponentially  damped  sinusoids  as  the  model  for  the  speech  waveform. 

7.  It  performs  sinusoid  frequency  (pole)  analysis  eind  complex  damping  factor 
7.  analysis  on  a  pitch-synchronous  level.  However,  each  voiced  pitch  period 
7.  is  subdivided  into  two  frames.  The  transition  point  is  determined  by 
7t  estimating  the  model  parameters  over  a  short  frame  at  the  beginning  of  the 
7,  pitch  period,  a  short  frame  at  the  end,  and  re-synthesizing  the  frame  using 
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y. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 


each  set  of  model  parameters.  Using  the  error  between  the  two  synthetic 
pitch  periods  and  the  original,  the  transition  point  is  identified  as  the 
point  of  intersection  of  the  two  error  waveforms.  The  synthetic  waveform 
is  the  sum  of  the  exponentially  damped  sinusoids  characterized  by  model 
parameters  estimated  over  each  new  subframe. 

Author:  Capt  A1  Arb,  USAF 
Date:  29  Jul  96 

Modified:  31  Jul  96  —  Added  comments,  pre-emphasis  for  unvoiced  frames, 

removed  "energy”  stuff. 

2  Aug  96  —  Added  initial  conditions  for  unvoiced  LPC  synth. 

6  Aug  96  —  Modified  code  to  read  in  binary  data  vice  having  it 
passed  as  an  input  parameter. 

12  Aug  96  —  Modified  function  header  to  return  the  original 
euid  saved  poles  as  well  as  the  number  of  voiced 
and  unvoiced  frames. 

Input  parameters: 

ifile:  The  original  speech  signal  data  file  name. 

pfile:  The  name  of  the  binary  data  file  containing  the  pitch  marks. 

Each  pitch  mark  should  be  some  non-zero  number, 
phnfile:  The  name  of  the  hand-labeled  phoneme  file  for  the  speaker  and 
sentence  corresponding  to  "data". 

iniord:  The  order  for  the  pole  determination  analysis  to  be  performed 
over  the  initial  frame  of  each  voiced  pitch  period  of  "data". 
This  number  corresponds  to  the  number  of  complex  poles 
estimated.  If  this  is  an  even  number,  there  will  be  iniord/2 
complex  conjugate  pairs  of  poles. 

secord:  The  order  for  the  pole  determination  analysis  to  be  performed 
over  the  second  frame  of  each  voiced  pitch  period  of  "data", 
uvord:  The  order  for  the  LPC  analysis  to  be  performed  over  the  unvoiced 
portion  of  "data", 
fs:  The  sampling  frequency. 

radius:  The  minimum  radius  on  the  z-plane  for  accepting  poles  predicted 
using  the  "backward"  linear  prediction  method, 
maxper:  The  maximum  allowable  period  for  voiced  speech.  Used  in 
determining  pitch  marks  when  there  are  large  gaps, 
uvpitch:  The  desired  frame  rate  during  long  periods  (>  maxper)  of 
unvoiced  speech. 

Output  Parameter: 

y:  The  synthetic  waveform. 

phspnt :  A  vector  containing  each  pitch  marker  denoted  by  a  +1  and  frame 
transition  points  denoted  by  a  -1 . 

origpoles:  A  matrix  whos  columns  contain  the  complex  poles  as  estimated 
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*/,  using  the  backward  prediction  process  for  voiced  frames. 

*/,  This  matrix  contains  all  ’order’  poles  prior  to  selection/ 

*/,  reflection. 

7,  savepoles:  A  matrix  whos  columns  contain  the  selected  and  reflected 
7,  complex  poles  estimated  using  the  backward  prediction  process 

7i  for  voiced  frames. 

7.  numv:  The  number  of  voiced  frames. 

7.  numuv:  The  number  of  unvoiced  frames. 

7. 

7.  Subroutines  directly  called: 

%  read_dat . m 

7<  newpmrks .  m 

7.  readphn.m 

7.  loadphndb.m 

7.  synthlpc .  m 

7.  synthv .  m 

7t  detvoice.m 

7t  calctrans.m 

7. 

7i  Subroutines  indirectly  called: 

7.  covlpc.m 

7.  getCoef.m 

7.  get  Amps,  m 

7.  reconst,  m 

7.  calcerror.m 

%  synth4tran.m 

7i  diffeq.m 

7. 


f unct ion  [y , phspnt , origpoles , savepoles , numv , numuv] =twophasedex (if ile ,pf ile , 
phnf ile , iniord,secord,uvord,fs .radius .maxper .uvpitch) 


7. 

7i  read  in  data  and  pitch  marks 

7. 

data=read_dat(if ile , ’short’) ; 
pmarks=read_dat (pf ile , ’ short ’ ) ; 

7.  re-calculate  pitch  marks  back  to  a  zero-crossing 
newmarks=newpmrks(data,pmarks ,fs .maxper .uvpitch) ; 

7. 

7t  initialize  some  variables 

7. 

numframes=length(newmarks( : .1)) ; 
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y=C]; 

maxord=max( Ciniord,secord,uvord] ) ; 

origpoles=zeros(maxord(l) ,1) ; 

savepoles=origpoles ; 

nujnphase=  []  ; 

numuv=0 ; 

niimv=0 ; 

num2phs=0 ; 

•/. 

y.  read  in  phoneme  file 

7. 

[phnind , phnval] =readphn(phnf ile) ; 
phndb=loadphndb ; 
phspnt=  □  ; 

tpoints=zeros(length(data) ,1) ; 

y. 

7t  Loop  through  each  pitch  period  generated  with  epochs  and  "newpmarks" 

y. 

for  i=l  :nuinf rames , 

samples=data(newmarks(i,l) :newmarks(i,2) ) ; 

7. 

y,  identify  which  phoneme  we’re  in 

y. 

cur_phoneme=f ind ( (phnind ( : , 1 ) <=newmarks ( i , 1 ) ) ft (phnind ( : , 2) >=newmarks ( i , 1 ) ) ) ; 
tpoints(newmarks(i,l))=l ; 

y. 

7,  Make  crude  voicing  determination  using  phoneme  file. 

7. 

vuv=detvoice (phnval (cur_phoneme, 1:4) ,phndb) ; 
if  vuv 


7. 

7,  Calculate  Phase  Transition  Point 

7. 

numv=numv+l ; 

tranpo int=calctrans (samples ,iniord,secord, radius) ; 
phspnt=  [phspnt  tranpoint] ; 

7. 

Synthesize  from  sample  1  to  the  transition  point  and  then  from  the 
7,  transition  point+l  to  the  end  of  the  period. 

7. 

num2phs=num2phs+l ; 

tpoints (newmarks (i , 1) +tranpoint-l)=-l ; 
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7. 

7,  Check  to  see  if  the  order  is  greater  than  the  number  of  sample  in  initial 
7.  frame.  If  so,  move  the  transition  point. 

7. 

if  tranpoint<=iniord, 
tranpoint=iniord+5 ; 
end; 

7. 

7t  Synthesize  the  new  initial  frame  auid  save  poles. 

7. 

[newframe,og,sv]=synthv(samples(l:tranpoint) .iniord, radius) ; 
origpole3=[origpoles  Cog;zeros(length(origpoles( : , l))-length(og)  ,1)]] ; 
savepoles=[savepoles  Csv;zeros(length(savepoles( : ,l))-length(sv) ,1)]]  ; 
numphase= [numphase  2]  ; 

7. 

y.  Check  to  see  if  order  is  greater  than  the  number  of  points  in  the  second 
7.  frame.  If  so,  move  the  transition  point  back. 

7. 

if  (length (samples )-tranpoint)  <=  secord, 
tr£uipoint=length(samples)-secord-5; 
end; 

7. 

7.  Synthesize  the  second  frame  and  save  poles. 

7. 

[temp , og , sv] =synthv( samples (tranpoint+1 : length(samples) ) , secord , radius)  ; 
origpoles=[origpoles  [og ; zeros (length(origpoles ( : ,l))-length(og) ,1)]]  ; 
savepoles= [savepoles  [sv ; zeros (length(savepoles ( : , l))-length(sv) ,1)]] ; 
numphase= [numphase  2] ; 

7. 

7i  Check  to  see  if  there  was  a  shift  in  transition  point  for  the  second  frame. 

7.  An  indication  will  be  too  many  points  in  the  resulting  two  subframes.  If  so 
7.  Overlap  and  average  the  first  extra  points  of  the  second  frame  with  the  last 
7.  extra  points  of  the  first  frame. 

7. 

if  (length(temp)+length(newframe))  >  length(samples) 
diff=length(temp)+length(newframe) -length (samples) ; 
tmpnf=newf rame(length(newframe)-diff ) ; 

tmpnf=[tmpnf ; (newframe(length(newframe)-diff+l : length(newframe) )  .  .  . 
+temp(l :diff )) ./2;temp(diff+l:length(temp))] ; 
newf rame=tmpnf ; 
else 

newframe= [newf rame ; temp]  ; 
end; 

7. 
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y,  Add  new  pitchperiod  onto  synthetic  vector. 

7. 

y=[y;newframe]  ; 
else 

y. 

y,  Unvoiced  portion  of  the  code.  Covariance  method  LPC  analysis  is  performed 
7,  to  determine  vocal  tract  filter  coefficients.  These  then  define  a  fileter 
7,  which  is  excited  by  unit  energy  white  noise.  Nno  phase  transition  point 
7,  is  calculated  aind  a  single  phase  is  assumed. 

7. 

numuv=numuv+l ; 
if  i==l, 

past=zeros(uvord,l) ; 
else 

past=y(length(y)-uvord+l :length(y)) ; 
end; 

newf rame=synthlpc (samples ,uvord .past , 0) ; 
numphase= [numphase  1]  ; 

7. 

7,  Add  new  frame  to  synthetic  vector. 

7. 

y=[y;newframe] ; 

end; 

end;  7,  for 
save  twophasedex 

B.3  Single- Phase  LPC  Master  Routine 

7,  function  y=onephaselpc(if ile ,pf ile ,phnf ile,singord,uvord,f s .maxper .uvpitch) 

7. 

7,  Function:  onephaselpc .m 

7,  Description:  This  function  analyzes  the  incoming  data  ("data")  using 
7,  LPC  models.  It  then  uses  the  LPC  coefficients  and  gain  to  re-synthesize 
7,  the  waveform  using  the  coefficients  and  gain  to  describe  the  reconstruction 
7,  filter,  and  drives  it  with  an  impulse  for  voiced  speech,  and  unit  variance, 
7,  zero  mean  white  noise  for  unvoced.  This  function  performs  pitch-synchr- 
7,  onous  analysis-synthesis. 

7. 

7,  Author:  Capt  A1  Arb,  USAF 
7,  Date:  29  Jul  96 

7,  Modified:  30  Jul  96  —  Added  pre-emphasis  for  unvoiced  frames. 

7,  —  Removed  all  "energy"  tweaking. 

7.  2  Aug  96  —  Removed  pre-emphasis. 

7,  6  Aug  96  —  Modified  code  to  read  in  binary  data  vice  having  it 

7,  passed  as  an  input  parameter. 
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•/. 

7,  Input  parameters: 

7,  ifile:  The  original  speech  signal  data  file  name. 

7.  pfile:  The  name  of  the  binary  data  file  containing  the  pitch  marks. 

7.  Each  pitch  mark  should  be  some  non-zero  number. 

7.  phnfile:  The  name  of  the  hand-labelled  phoneme  file  for  the  speaker  and 
7.  sentence  corresponding  to  "data". 

7.  singord:  The  order  for  the  LPC  analysis  to  be  performed  over  the  voiced 
7t  portion  of  "data". 

7.  uvord:  The  order  for  the  LPC  analysis  to  be  performed  over  the  unvoiced 
7t  portion  of  "data". 

7.  fs:  The  sampling  frequency. 

7«  maxper:  The  maximum  allowable  period  for  voiced  speech.  Used  in 
7.  determining  pitch  marks  when  there  aie  large  gaps. 

7.  uvpitch:  The  desired  frame  rate  during  long  periods  (>  maxper)  of 
7.  unvoiced  speech. 

7. 

7t  Output  Parameter: 

7,  y:  The  synthetic  waveform. 

7, 

7.  Subroutines  directly  called: 

7.  read_dat.m 

7,  newpmrks .  m 

7.  readphn.m 

7.  loadphndb .  m 

7>  synthlpc.m 

7i  detvoice.m 

7. 

7t  Subroutines  indirectly  called: 

7i  covlpc.m 

7.  diffeq.m 

function  y=onephaselpc(if ile, pfile, phnfile, singord, uvord, fs, maxper .uvpitch) 


7. 

7.  read  in  pitch  marks 

7. 

data=read_dat (ifile, ’short’) ; 
pmarks=read_dat (pfile,’ short ’ ) ; 

7t  re-calculate  pitch  marks  back  to  a  zero-crossing  as  well  as  break  up 
7.  long  periods  between  marks  identifying  some  unvoiced  regions. 

newmarks=newpmrks ( dat  a , pmarks , f  s , maxper , uvp itch); 
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'/,  initialize  some  variables 

•/. 

numframes=length(newmarks( : , 1) ) ; 

y=[]; 

niunuv=0 ; 
numv=0 ; 

•/. 

'/,  read  in  phoneme  file 

•/. 

[phnind , phnval] =readphn(phnf ile) ; 
phndb=loadphndb ; 

•/. 

*/,  Loop  through  each  pitch  period  generated  with  epochs  and  "newpmarks" 

7. 

for  i=l:numframes, 

samples=data(newmarks(i,l) :newmarks(i,2)) ; 

7. 

7.  identify  which  phoneme  we’re  in 

7. 

cur_phoneme=find ((phnind ( : ,l)<=newmarks(i,l))&(phnind(: ,2)>=newmarks(i,l))) ; 

7. 

7.  Make  crude  voicing  determination  using  phoneme  file. 

7, 

vuv=detvoice(phnval(cur_phoneme,l:4) ,phndb) ; 
if  vuv 
numv=n\nnv+l ; 

7. 

7t  Get  past  p=singord  values  of  synthesized  speech  for  use  in  LPC  analysis.  If 
7.  we  are  on  the  first  frame,  assiime  initial  conditions  are  zero. 

7. 

if  i==l, 

past=zeros(singord,l) ; 
else 

past=y(length(y)-singord+l :length(y)) ; 
end; 

7. 

'/,  Analyze  and  synthesize  the  frame 

7. 

newf rame=synthlpc (samples , s ingord , past , 1) ; 

7. 

7.  Add  new  frame  onto  synthesized  waveform 

7. 

y~[y;newframe] ; 
else 
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y,  Unvoiced  portion  of  the  code. 

y. 

numuv=numuv+l ; 

y. 

y,  Synthesize  unvoiced  frame.  There  is  pre-emphasis.  Initial  conditions  are 
y,  assumed  to  be  zero  as  it  is  less  important  for  unvoiced  segments. 

y. 

y. 

y.  Get  past  p=singord  values  of  synthesized  speech  for  use  in  LPC  analysis.  If 
y,  we  are  on  the  first  frame,  assume  initial  conditions  are  zero. 

y. 

if  i==l, 

past=zeros(uvord,l) ; 
else 

past=y(length(y)-uvord+l:length(y)) ; 
end; 

newframe=synthlpc (samples ,uvord,past ,0) ; 

y. 

y,  Add  on  new  frame 

y. 

y=Cy;newframe] ; 
end; 

end;  '/,  for 
save  onephaselpc 

B.4  Two-Phase  LPC  Master  Routine 

y,  function  [y,phspnt]=twophaselpc(ifile,pfilo,phnfile,iniord,secord,uvord,fs, . . . 
y,  radius,  maxper.uvpitch) 

y. 

y.  Function:  twophaselpc .m 

y.  Description:  This  function  analyzes  the  incoming  data  ("data")  using 
y,  LPC  models.  Each  voiced  pitch  period  is  subdivided  into  two  frames.  The 
y,  transition  point  is  determined  by  estimating  the  model  parameters  over  a 
y,  short  frame  at  the  beginning  of  the  pitch  period,  a  short  frame  at  the  end, 
y,  and  re-synthesizing  the  frame  using  each  set  of  model  parameters.  Using 
y,  the  error  between  the  two  synthetic  pitch  periods  and  the  original,  the 
y,  transition  point  is  identified  as  the  point  of  intersection  of  the  two 
y,  error  waveforms.  It  then  uses  the  covariance  method  LPC  to  re-synthesize 
y,  the  waveform  using  the  coefficients  and  gain  to  describe  the  reconstruction 
y,  filter,  and  drives  it  with  an  impulse  for  voiced  speech,  and  unit  energy, 
y,  zero  mean  white  noise  for  unvoiced. The  synthetic  waveform  is  computed  over 
y,  each  subframe  for  voiced  speech. 

y. 
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'/,  Author:  Capt  A1  Arb,  USAF 
*/,  Date:  2  Aug  96 

7,  Modified:  6  Aug  96  —  Modified  code  to  read  in  binary  data  vice  having  it 
7,  passed  as  an  input  parameter. 

7. 

7.  Input  parameters: 

7i  ifile:  The  original  speech  signal  data  file  name. 

7.  pfile:  The  name  of  the  binary  data  file  containing  the  pitch  marks. 

Each  pitch  mark  should  be  some  non-zero  number. 

7.  phnfile:  The  name  of  the  hand-labeled  phoneme  file  for  the  speaker  and 
7.  sentence  corresponding  to  "data". 

7i  iniord:  The  order  for  the  LPC  analysis  to  be  performed  over  the  initial 
7i  phase  of  the  voiced  portion  of  "data" . 

7t  secord:  The  order  for  the  LPC  analysis  to  be  performed  over  the  second 
'f,  phase  of  the  voiced  portion  of  "data". 

7t  uvord:  The  order  for  the  LPC  analysis  to  be  performed  over  the  unvoiced 
7t  portion  of  "data". 

7t  fs:  The  sampling  frequency. 

7t  radius:  The  minimum  radius  on  the  z-plane  for  accepting  poles  predicted 
y,  using  the  "backward"  linear  prediction  method.  Used  in 

7i  transition  point  determiniation  only. 

7.  maxper:  The  maximum  allowable  period  for  voiced  speech.  Used  in 
7«  determining  pitch  marks  when  there  are  large  gaps. 

7.  uvpitch:  The  desired  frame  rate  during  long  periods  (>  maxper)  of 
7.  unvoiced  speech. 

7. 

7i  Output  Parameter: 

'/,  y:  The  synthetic  waveform. 

7.  phspnt:  A  vector  containing  each  pitch  marker  denoted  by  a  +1  and  frame 
7t  transition  points  denoted  by  a  -1. 

7. 

7.  Subroutines  directly  called; 

7.  read_dat.m 

7t  newpmrks .  m 

7«  readphn.m 

7i  loadphndb.m 

7i  synthlpc2.m 

7.  detvoice.m 

'/,  calctrans.m 

7. 

7.  Subroutines  indirectly  called: 

'/,  covlpc.m 

7.  getCoef.m 

7.  get  Amps,  m 

7.  reconst. m 
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7,  calcerror.m 

7,  synth4tran.m 

7«  diffeq.m 

7. 


function  [y,phspnt]=twophaselpc(if ile.pf ile.phnf ile,iniord,secord,uvord,fs, .  . . 
radius .maxper .uvpitch) 


7. 

7«  read  in  data  and  pitch  marks 

7. 

data=read_dat ( if ile , ’ short ’ ) ; 
pmarks=read_dat (pf ile , ’ short ’ ) ; 

re-calculate  pitch  marks  back  to  a  zero-crossing 

newmarks=newpmrks(data,pmarks ,f s .maxper .uvpitch) ; 

7. 

7i  initialize  some  variables 

7. 

numf rames=length(newmarks( : . 1) ) ; 

y=[]: 

maxord=max( [iniord.secord.uvord] ) ; 

numphase= [] ; 

numuv=0 ; 

numv=0 ; 

num2phs=0 ; 

7. 

7i  read  in  phoneme  file 

7. 

[phnind.phnval] =readphn(phnf ile) ; 
phndb=loadphndb ; 
phspnt=  []  ; 

tpoints=zeros(length(data) .1) ; 

7. 

7.  Loop  through  each  pitch  period  generated  with  epochs  and  "newpmarks" 

7. 

for  i=l :numframes. 

samples=data(newmarks(i. 1) :newmarks(i.2) ) ; 

7. 

7.  identify  which  phoneme  we’re  in 

7. 

cur_phoneme=f ind( (phnind( : .l)<=newmarks(i. 1) )&(phnind( : .2)>=newmarks(i , 1) ) ) ; 
tpoints(newmarks(i.l))=l; 
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7. 

7.  Make  crude  voicing  determination  using  phoneme  file. 

7. 

vuv=detvoice(phnval(cur_phoneme,l:4) ,phndb) ; 
if  vuv 


7. 

7.  Calculate  Phase  Transition  Point 

7. 

numv=numv+l ; 

tranpoint=calctrans(samples ,iniord,secord, radius) ; 
phspnt= [phspnt  tranpoint] ; 

7. 

7«  Synthesize  from  sample  1  to  the  transition  point  and  then  from  the 
V,  transition  point+1  to  the  end  of  the  period. 

7. 

num2phs=num2phs+l ; 

tpoints(newmarks(i, l)+tranpoint-l)=-l ; 

7. 

7.  Check  to  see  if  the  order  is  greater  than  the  number  of  sample  in  initial 
7.  frame.  If  so,  move  the  transition  point. 

7. 

if  tranpoint<=iniord, 
tranpoint=iniord+5  j 
end; 

7. 

*/,  Get  past  p=iniord  values  of  synthesized  speech  for  use  in  LPC  analysis.  If 
7.  we  are  on  the  first  frame,  assume  initial  conditions  are  zero. 

7. 

if  i==l, 

past=zeros(iniord,l) ; 
else 

past=y(length(y)-iniord+l :length(y)) ; 
end; 

7. 

7t  Synthesize  the  new  initial  frame. 

7. 

newframe=synthlpc2(samples(l : tranpoint) , iniord,past ,1 , 1) ; 
numphase= [numphase  2]  ; 


7. 

7.  Check  to  see  if  order  is  greater  than  the  number  of  points  in  the  second 
7,  frame.  If  so,  move  the  transition  point  back. 

7. 
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if  (iGngth(samples)-tranpoint)  <=  secord, 
tranpoint=lGngth(samples) -secord-5 ; 
end; 

•/. 

*/,  Get  past  p=secord  values  of  synthesized  speech  for  use  in  LPC  analysis. 

7. 

past=newframe(length(newframe)-secord+l:length(neHframe)) ; 


7. 

7.  Synthesize  the  second  frame. 

7. 

temp=synthlpc2 (samples (tranpoint+1 : length(samples) ) , secord , past ,1,2); 
numphase= [numphase  2]  ; 

7. 

7t  Check  to  see  if  there  was  a  shift  in  transition  point  for  the  second  frame. 

7,  An  indication  will  be  too  many  points  in  the  resulting  two  subframes.  If  so 
7,  Overlap  and  average  the  first  extra  points  of  the  second  frame  with  the  last 
7,  extra  points  of  the  first  frame. 

7. 

if  (length(temp)+length(newframe))  >  length(samples) 
diff=length(temp)+length(newframe) -length (samples) ; 
tmpnf=newframe(length(newframe)-diff ) ; 

tmpnf=[tmpnf ; (newframe(length(newframe)-diff+l:length(newframe)) . . . 
+temp(l :diff )) ./2;temp(diff+l:length(temp))] ; 
newf rame=tmpnf ; 
else 

newf rame= [newf rame; temp] ; 
end; 

7. 

7,  Add  new  pitchperiod  onto  synthetic  vector. 

7. 

y=[y; newf rame] ; 
else 

7. 

7i  Unvoiced  portion  of  the  code.  Covariance  method  LPC  analysis  is  performed 
7.  to  determine  vocal  tract  filter  coefficients.  These  then  define  a  fileter 
7,  which  is  excited  by  unit  energy  white  noise.  Nno  phase  transition  point 
7,  is  calculated  and  a  single  phase  is  assumed. 

7. 

numuv=numuv+l ; 
if  i==l, 

past=zeros(uvord, 1) ; 
else 

past=y(length(y)-uvord+l : length(y)) ; 
end; 
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neMframe=synthlpc(samples,uvord,past ,0) ; 
numphasG* [numphase  1]; 

7. 

7,  Add  new  frame  to  synthetic  vector. 

7. 

y=[y;newframe] ; 

end; 

end;  7.  for 
save  twophaselpc 


B.5  Converting  Pitchmark  File  Into  FYame  Boundaries 


7t  function  newpfile=newpinrks (datafile,  pitchfile,  f s ,maxperiod,uvpitch) ; 

7. 


7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7. 

7, 

7. 


Function:  newpmrks.m 

Description:  This  function  converts  the  single  vector  (pitchfile)  containing 
nonzero  values  where  a  closing  instant  was  determined,  into  a 
range  identifying  the  starting  point  and  ending  point  of  each 
new  analysis  frame/pitch  period.  If  there  is  a  long  gap  between 
pitch  mairks,  the  gap  is  broken  into  smaller  analysis  frames. 

The  size  of  these  frames  is  determined  by  the  parameters  uvpitch 
and  fs.  The  function  will  also  move  the  starting  point  of  each 
frame  to  a  point  where  the  speech  data  has  made  a  transition 
through  0. 


7t  Author:  Capt  A1  Arb,  USAF 
7t  Date:  29  Jul  96 
7t  Modified: 


7.  Input  parameters : 

7i  datafile:  The  original  speech  data. 

7.  pitchfile:  The  vector  containing  nonzero  values  where  a  closing  instant 
7.  has  been  identified. 

7.  fs:  The  sampling  frequency. 

7.  maxperiod:  The  size  of  the  period  (in  seconds)  between  pitch  marks  above  7# 

7.  uvpitch:  The  frame  rate  to  be  used  to  subdivide  long  periods  of  speech 

7.  with  no  pitch  marks. 

7. 


7.  Output  Parameter: 

7.  newpfile:  A  (number  of  frames)  x  2  matrix  containing  the  starting  point 
7.  and  ending  point  of  each  analysis  frame. 

7. 

7.  Subroutines  directly  called: 
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7,  none 

7. 

7.  Subroutines  indirectly  called; 

7i  none 

7. 

function  newpfile=newpmrks (datafile ,  pitchfile,  f s ,maucperiod,uvpitch) ; 


7. 

7i  Identify  location  of  each  pitchmark 

7. 

pitchmarks=f indCpitchf ile) ; 

7. 

7.  Initialize  some  variables 

7. 

done=0 ; 
lastmark=l ; 
nextmark=0 ; 
count=l ; 
total=l ; 
newpfile=[]  ; 

uvinc=floor(fs/uvpitch) ; 


7. 

7.  Keep  looping  through  until  we  hit  every  pitchmark 

7. 

while  count  <=  length (pit chmarks) , 


nextmark=pitchmarks( count) ; 

7. 

7i  Check  to  see  if  we  have  already  covered  the  next  pitch  mark.  If  not, 
%  proceed.  If  so  move  on  to  next  pitch  mark 

7. 

if  nextmark  >  lastmark 


7. 

7,  Check  to  see  if  there  is  a  large  gap  between  pitchmarks,  typically 
7.  identifying  unvoiced  regions.  If  not,  move  pitchmark  to  a  zero-crossing 
7.  If  so,  divide  into  smaller  analysis  frames. 

7. 

if  (nextmark-lastmark  <=  f s*maxperiod) 

if  datafile (nextmark)  >  0 
zerocrossf ound=0 ; 
while  "zerocrossf ound 
nextmark=nextmark-l ; 
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if  datafile (nextmark)  <=  0 
zerocrossf ound=l ; 
end;  '/,  if  datafile 
end;  '/,  while 

elseif  datafile (nextmark)  <  0 
zerocrossf ound=0 ; 
while  “zerocrossfound 
nextmark=nextmark+l ; 

if  datafile (nextmark)  >=  0 
zerocrossf ound=l ; 
end;  */,  if  datafile 
end;  '/.  while 

end;  '/,  if 

newpfile (total, :)=[lastmark  nextmark] ; 
total=total+l ; 
lastmark=nextmark+l ; 
count=count+l ; 
else 

•/. 

y.  Else  divide  into  smaller  analysis  frames. 

•/. 

numsubframes=f loor( (nextmark-lastmark)/uvinc) ; 
onextmark=nextmark ; 
nextmark=lastmark+uvinc ; 

y. 

y,  Loop  through  all  but  last  subframe. 

y. 

for  k=l :numsubf rames-1 , 

newpf ile(total , : )= [lastmark  nextmark] ; 
total=total+l ; 
lastmark=nextmark+l ; 
nextmark=nextmark+uvinc ; 

end;  '/.for  k 

newpf ile(total ,:)= [lastmark  onextmark] ; 
total =total+l ; 
lastmark=onextmark+l ; 
count=count+l ; 

end;  '/,  if  (nextmark-lastmark  <=  fs*maxperiod) 


else 

count=count+l ; 

end;  */,  if  nextmark  >  lastmark 
end;  */. 
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7. 

7,  Handle  last  interval  (end  of  speech  segment) 

7. 

nextmark=length(dataf ile) ; 
onextmark=nextmark ; 

7. 

7«  if  small  enough,  keep  as  single  analysis  frame,  otherwise  divide  as  above 

7. 

if  (nextmark-lastmark  <=  f s*maxperiod) 
newpf  ile  (total,  :)  =  [lastmark  nextmaurk]  ; 
else 

numsubframes=floor( (nextmark-lastmark) /uvinc) ; 
nextmark=lastmark+uvinc ; 
for  k=l :numsubframes-l , 

newpf ile (total, :)=[lastmark  nextmark] ; 
total=total+l ; 
lastmark=nextmark+l ; 
nextmark=nextmark+uvinc ; 
end;  7ifor  k 

newpf ile (total, : )  =  [lastmark , onextmark] ; 
lastmark=onextmark+l ; 
total=total+l ; 
count=count+l ; 

end;  %  if  (nextmark-lastmark  <=  f s*maxperiod) 

B.6  Reading  a  TIMIT  phoneme  label  file 
7,  function  [a,b]  =readphn(phnf  ile)  ; 

7. 

7.  Function:  readphn.m 

7,  Description:  This  function  reads  in  the  hand  labelled  TIMIT  phoneme  file  and 
7.  returns  a  matrix  of  phonmeme  labels  and  a  matrix  of  phoneme 

7.  start  points  and  end  points. 

7. 

7.  Author:  Capt  A1  Arb,  USAF 
7.  Date:  29  Jul  96 
7t  Modified: 

7. 

V,  Input  parameters: 

7.  phnfile:  The  name  of  the  TIMIT  phoneme  file. 

7. 

7,  Output  Parameters: 

7,  a:  A  matrix  containing  the  starting  point  and  ending  points  of  each 
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*/,  phoneme  in  columns  1  amd  2  respectively.  Each  row  is  a  different 

*/,  phoneme . 

•/. 

b:  A  matrix  of  phoneme  labels.  Each  row  is  a  different  phoneme. 

7. 

7.  Subroutines  directly  called: 

7.  none 

7. 

7.  Subroutines  indirectly  called: 

7a  none 

7. 

function  [a,b]=readphn(phnf ile) ; 

a=[]  j 
b=[]; 

7. 

7t  Open  TIMIT  phoneme  file  for  reading. 

7. 

f id=f openCphnf ile , ’r’) ; 

7. 

7a  Continue  to  read  until  reaching  the  end-of-file. 

7a 

while  “feof(fid) 

7a 

7a  Get  one  line  of  the  file  as  a  string. 

7a 

s=fgetl(f id) ; 

7a 

7a  Check  to  see  if  it’s  the  end  of  file,  if  not,  continue  to  process. 

7a 

if  s~=(-l) 

7a 

7a  Break  up  string  into  2  integers  and  a  string. 

7a 

p=sscanf  (s ,  ’7ai  7ai  7aS  ’ ) ; 

7a 

7a  The  two  integers  are  the  start  point  and  end  point  of  the  phoneme. 

7a 

a=[a;p(l)  p(2)] ; 

7a 

7a  Set  the  numerical  version  of  the  label  string  to  an  actual  string. 

7a 

p=setstr (p(3 : length(p) ) ) ’ ; 

7a 

7a  Prepend  the  string  with  spaces  to  bring  length  of  string  to  4  with  the 
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*/,  label  right  justified. 

•/. 

if  length(p)==2 
p=[’  ’  p3 ; 

elseif  length(p)==3 
P=C’  ’  p] ; 
elseif  length(p)==l 
p= [ ’  ’  p] ; 

end; 

7. 

7i  add  new  label  to  b  matrix 

7. 

b=Cb;p]  ; 

end; 

end; 

fcloseCf id) ; 

B.7  Voicing  determination 

'/,  function  vuv=detvoice(curphon,phndb) ; 

7. 

7.  Function:  detvoice.m 

7.  Description:  This  function  determines  whether  the  phoneme  interval  containing 
7.  the  current  frame  of  speech  is  voiced  or  unvoiced. 

7. 

7.  Author:  Capt  A1  Arb,  USAF 

7.  Date:  30  Jul  96 

7i  Modified: 

7. 

'/,  Input  pairameters: 

7.  curphon:  The  phoneme  label  for  the  current  frame  of  speech. 

7.  phndb:  The  TIMIT  phoneme  data  base  matrix. 

7. 

7.  Output  Parameter: 

7.  vuv:  voiced/unvoiced.  l=voiced,  0=unvoiced. 

7. 

7i  Subroutines  directly  called: 

7i  none 

7. 

7.  Subroutines  indirectly  called: 

7.  none 

7. 

function  vuv=detvo ice (curphon, phndb) ; 
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vuv=0; 
count=0 ; 
done=0 ; 

7. 

7,  Loop  until  we  find  the  label 

7. 

while  “done 

count=count+l ; 

7. 

7t  If  the  DB  entry  =  curphon,  we  found  it. 

7. 

if  phndb(count,l:4)==curphon 
done=l ; 
phn=count ; 

7. 

7i  Or  if  we  are  at  the  end  of  the  file,  stop  and  assume  unvoiced. 

7. 

elseif  count==length(phndb( : ,1)) 
done=l ; 
phn=0 ; 
end; 
end; 
if  phn 

7. 

7.  If  the  category  is  VOICED,  set  vuv  to  1. 

7. 

if  phndb(phn,5:10)==’V0ICED' 
vuv=l ; 
end; 
end; 

B.8  Determining  Transition  Point 

7.  function  transpoint=calctrans(samples,iniord,secord, radius) 

7. 

7,  Function:  calctrans.m 

7.  Description:  This  function  estimates  the  transition  point  within 
7.  a  voiced  pitch  period  where  the  vocal  tract  model 

7i  parameters  should  be  re-estimated.  The  trauisition  point 

7.  is  determined  by  estimating  the  model  parameters  over  a 

7.  short  frame  at  the  beginning  of  the  pitch  period,  a 

7.  short  frame  at  the  end,  and  re-synthesizing  the  frame 

7.  using  each  set  of  model  parameters.  Using  the  error 

7.  between  the  two  synthetic  pitch  periods  and  the 

7.  original,  the  transition  point  is  identified  as  the 
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'/,  point  of  intersection  of  the  two  error  waveforms. 

7. 

7,  Author:  Capt  A1  Arb,  USAF 
7.  Date:  29  Jul  96 

7.  Modified:  31  Jul  96  -  Added  comments. 

7. 

7.  Input  parameters: 

7.  samples:  The  original  pitch  period  of  data. 

7.  iniord:  The  analysis  order  to  be  used  for  the  initial  frame 

7«  parameter  estimates. 

%  secord:  The  analysis  order  to  be  used  for  the  second  frame 

%  parameter  estimates. 

7t  radius:  The  minimum  radius  on  the  z-plane  for  accepting  poles 

%  predicted  using  the  "backward"  linear  prediction 
7.  method. 

7. 

7t  Output  Parameters: 

'!%  tranpoint:  The  point  at  which  the  vocal  tract  paurameters  should 

7i  be  allowed  to  change. 

7. 

7.  Subroutines  directly  called: 

7.  calcerror.m 

7. 

7.  Subroutines  indirectly  called: 

7<  none 

7. 

f unct ion  transpoint=calctrans (samples , iniord , secord , radius) 

rho=  []  ; 

7. 

7f  Set  size  of  window  over  which  the  model  is  estimated  to  1/3  the  frame  length. 

7. 

wlengthi=floor (length (samples) /3) ; 
wlengths=f loor(length(samples)/3) ; 

7. 

%  If  this  length  is  shorter  than  the  model  order  desired,  we  need  to 
7.  increase  the  number  of  samples  used  for  the  short  window. 

7. 

if  wlengthi  <=  iniord 

wlengthi=min( [0 . 8*length(samples) , 1 . 5*iniord] ) ; 
end; 

if  wlengths  <=  secord 

wlengths=min( [0.8*length (samples) ,1.5*secord]) ; 
end; 
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y. 

7,  Initialize  "temp"  as  the  first  window.  Prepare  to  estimate  model  over 
7,  portion  of  initial  phase  and  re-synthesize. 

7. 

temp=samples(l rwlengthi) ; 

7. 

7i  Compute  Polynomial  Coefficients 

7. 

theta=getCoef (temp.iniord) ; 
tooshort=0; 

7. 

7t  Only  try  to  pair  down  poles  if  "getCoef"  generated  at  least  one. 

7. 

if  (length(theta)  >  1) 

7. 

7,  Use  "roots"  to  identify  the  actual  poles  corresponding  to  the 
coefficients. 

7. 

poles=roots (theta) ; 
numpoles=0; 

7, 

7,  Look  at  each  pole.  If  its  radius  in  the  Z-plane  is  >=  radius, 

7,  we’ll  keep  it.  If  not,  reject  it.  Those  poles  outside  the  unit 
circle  are  reflected  in  and  those  inside  reflected  out  (l/conjO). 

7. 

for  j=l:length(poles) 

if  (abs(poles(j))  >=  radius) 
numpoles=numpoles+l ; 
rho(numpoles)=l/conj (poles(j)) ; 
end;  '/,  if 
end;  7.  for  j 

7. 

7.  If  no  poles  were  saved,  we  have  a  problem.  Decide  to  keep  all  outside  a 
7,  a  radius  of  0.5  without  reflecting  them  (i.e.  keep  them  stable). 

7. 

if  numpoles==0 

sv=f ind(abs(poles)>=0.5) ; 
rho=poles(sv) ; 
end; 

rho=rho ( : ) ; 

7. 

7,  Compute  Complex  Amplitudes 

7. 
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A=get Amps (temp, rho) ; 


•/. 

'/•  Generate  synthesized  frame. 

7. 

newframe_closed=reconst (A, rho, length (samples)) ; 
else 

tooshort=l ; 
end; 

7. 

%  Repeat  entire  process  for  the  second  phase  estimating  the  model  over  a  small 
7t  section  at  the  end  of  the  pitch  period. 

7. 

temp=samples(length(samples)-wlengths:length(samples)) ; 

theta=getCoef (temp,secord) ; 

if  (length(theta)  >  1) 
poles=roots (theta) ; 
numpoles=0; 
for  j=l :length(poles) 

if  (abs(poles(j))  >=  radius) 
numpoles=numpoles+l ; 
rho(numpoles)=l/conj (poles(j)) ; 
end;  7.  if 
end;  7t  for  j 
if  numpoles==0 

sv=find(abs (poles) >=0.5) ; 
rho=poles(sv) ; 
end; 

if  length(rho) “=0 
rho=sort (rho ( : )  )  ; 

A=getAmps(temp,rho) ; 

newframe_open=synth4tran(A,rho,length(samples) ,length(temp) ) ; 
else 

tooshort=l ; 
end; 

else 

tooshort=l ; 
end; 

7. 

7,  Calculate  error  between  original  and  each  synthesized  frame  and  identify 
7.  a  transition  point  if  one  exists  between  20  amd  507.  of  the  frame. 

7. 

if  "tooshort 
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transpoint=calcerror (samples, newframe_closed,newframe_open) ; 
else 

transpoint=0; 

end; 

B.9  Synthesizing  Damped- Exponential  Frame 

7,  fxmction  [synth, poles, rho]=synthv(samples, order, radius)  ; 

7. 

7.  Function:  synthv.m 

7.  Description:  This  function  uses  three  subroutines  as  it  analyzes  a  frame 
y,  of  speech  and  re-synthesizes  it  using  a  sum  of  exponentially 

7t  damped  sinusoids  as  the  speech  model. 

7. 

7.  Author:  Capt  A1  Arb,  USAF 
7  Date:  29  Jul  96 

7  Modified:  30  Jul  96  -  Added  comments. 

7 

7  Input  parameters: 

7  samples:  The  original  frame  of  speech. 

7  order:  The  order  for  the  pole  determination  analysis  to  be  performed 

7  over  the  voiced  portion  of  "data".  This  number  corresponds 

7  to  the  number  of  complex  poles  estimated.  If  this  is  an  even 

7  number,  there  will  be  singord/2  complex  conjugate  pairs  of 

7  poles . 

7  radius:  The  minimum  radius  on  the  z-plauie  for  accepting  poles 

7  predicted  using  the  "backward"  lineeu:  prediction  method. 

7 

7  Output  Parameters: 

7  synth:  The  synthetic  frame. 

7  poles:  The  original  complex  poles  estimated  using  the  linear 
7  prediction  equations  in  the  backward  direction. 

7  rho:  The  complex  poles  retained.  I.E.  those  with  a  magnitude  (abs)  >= 
7  "radius".  These  poles  have  also  been  reflected  inside/outside 

7  the  unit  circle. 

7 

7  Subroutines  directly  called: 

7  getCoef.m 

7  get Amps. m 

7  reconst. m 

7 

7  Subroutines  indirectly  called: 

7  none 

7 

function  [synth, poles, rho]=synthv(samples, order, radius) ; 
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7,  initialization 


y=C]; 

rho=  []  ; 

7. 

7.  Compute  Polynomial  Coefficients 

7. 

theta=getCoef (samples, order) ; 

7. 

7.  Only  try  to  pair  down  poles  if  "getCoef"  generated  at  least  one. 

7. 

if  (length(theta)  >  1) 

7. 

7.  Use  "roots"  to  identify  the  actual  poles  corresponding  to  the  coefficients. 

7. 

poles=roots (theta) ; 
numpoles=0 ; 

7. 

7.  Look  at  each  pole.  If  its  radius  in  the  Z-plane  is  >=  ’radius*,  we’ll  keep 
7.  it.  If  not,  reject  it.  Those  poles  outside  the  unit  circle  are  reflected  in 
7.  and  those  inside  reflected  out  (l/conjO). 

7. 

for  j=l:length(poles) 

if  (abs(poles(j))  >=  radius) 
numpoles=numpoles+l ; 
rho(numpoles)=l/conj (poles(j)) ; 
end ;  if  ( .  . . 
end;  7.  for  j 
7. 

7.  If  no  poles  were  saved,  we  have  a  problem.  Decide  to  keep  all  outside  a 
7.  a  radius  of  0.7  without  reflecting  them  (i.e.  keep  them  stable). 

7. 

if  numpoles==0; 

sv=f ind(abs(poles)>=0.7) ; 
rho=poles(sv) ; 

end; 

if  length(rho)  "=0 

7. 

7t  Compute  Complex  Amplitudes 

7. 

A=get Amps (samples , rho) ; 

7. 

7i  Generate  synthesized  frame. 


77 


7. 


synth=reconst (A , rho , length(samples) ) ; 
end; 

end; 

rho=rho ( : ) ; 
poles=poles(:) ; 


B.IO  LPC  synthesis 


V,  function  synth=synthlpc(samples .order, past  ,vuv)  ; 

7. 


7.  Function  Name:  synthlpc 

7.  Description:  This  routine  performs  LPC  analysis-synthesis  on  a 
7i  frame  of  speech  (samples)  and  returns  a  synthetic  frame. 

7. 


7i  Author:  Capt 
'/,  Date:  29  Jul 
7.  Modified:  30 
7.  31 

7. 

7. 

7. 

7.  1 

7. 

7. 

7. 

7.  2 


A1  Arb,  USAF 
96 

Jul  96  —  Changed  to  all  Covariance  method, 

Jul  96  —  Corrected  energy  normalization  for  unvoiced  filter 
driving  sequence. 

—  Removed  filter  initial  conditions  (Zi)  from  voiced 
filter.  Improved  resulting  synthesis. 

Aug  96  —  Implanted  a  straight  difference  equation  for  synthesis 
vice  Matlab’s  filter  command.  Now  use  initial 
conditions  for  all  LPC  synthesis. 

—  Added  examiniation  of  poles  for  instability. 

Aug  96  —  Moved  pole  examination  to  covlpc.m 


7i  Input  Parameters: 

7i  samples:  The  original  frame  of  speech. 

7f  order:  The  analysis  model  order. 

7.  past:  The  initial  conditions,  i.e.  the  last  "order"  niunber 
'/,  of  samples  from  the  previous  frame. 

7i  vuv:  Voiced/unvoiced  flag:  0=unvoiced,  nonzero=voiced. 

7. 


7t  Output  parameters: 

7t  synth:  The  synthetic  frame. 

7. 


7.  Subroutines  directly  called: 
7.  covlpc.m 

7.  diffeq.m 

7. 
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7,  Subroutines  indirectly  called: 

7.  none 

7. 

function  synth=synthlpc (samples, order, past, vuv) ; 

7. 

7.  Check  voiced/unvoiced  flag 

7. 

if  vuv, 

7. 

7  get  LPC  coeficients  and  error  using  Coveuriance  method  LPC. 

7. 

[a,E]=covlpc( [past ; samples] , order) ; 

7. 

7.  calculate  gain  term  as  the  square  root  of  the  error. 

7. 

G=sqrt(abs(E)) ; 

7. 

7.  Synthesize  the  new  voiced  frame  using  an  impulse  driven  filter. 

7. 

synth=diffeq(G,a, [1 ;zeros(length(samples)-l ,1)] ,past) ; 
else  7.  unvoiced 

[a, E]=covlpc( [past; samples] .order) ; 

7. 

7»  calculate  gain  term  as  the  square  root  of  the  error. 

7. 

G=sqrt(abs(E)) ; 

7. 

%  Generate  synthetic  unvoiced  frame  using  white  noise  driven  filter. 

7. 

rndseq=randn(length(samples) ,1) ; 
rndseq=rndseq.  /sqrt  (sum(rndseq.  ■'2)  )  ; 
synth=diff eq(G,a,rndseq,past) ; 

end; 


B.ll  Determining  Damped- Exponential  Coefficients 
7,  function  theta=getCoef  (data, order)  ; 

7. 

7t  Function:  getCoef.m 

7.  Description:  This  function  estimates  the  denominator  polynomial  coefficients 
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*!,  for  a  causal  system  described  by  a  set  of  linear  equations 

7,  formed  in  the  "backward"  direction. 

7. 

7.  Author:  Capt  A1  Arb,  USAF 
7t  Date:  29  Jul  96 

7.  Modified:  30  Jul  96  -  Added  comments. 

7. 

7t  Input  parameters: 

7i  data:  The  original  freime  of  speech. 

7.  order:  The  order  for  the  pole  determination  analysis  to  be  performed 

7t  over  the  voiced  portion  of  "data".  This  number  corresponds  to 

7.  the  number  of  complex  poles  estimated.  If  this  is  am  even 

'/,  number,  there  will  be  singord/2  complex  conjugate  pairs  of 

poles. 

7. 

7.  Output  Parameters: 

7t  theta:  The  estimated  polynomial  coefficients. 

7. 

7t  Subroutines  directly  called: 

7i  none 

7. 

7.  Subroutines  indirectly  called: 

7«  none 

7. 

function  theta=getCoef (data, order) ; 


7. 

7.  Initialize  variables/matrices 

7. 

data=data( : ) ; 
numsamples=length(data) ; 


7. 

7.  Build  Y  matrix  amd  b  vector  for  backward  prediction  [b| A]theta=[0] 

7.  A  is  the  Singular  Value  Decomposition  of  Y  with  the  diagonal  elements 
7.  of  the  S  matrix  inverted. 

7. 

for  i=l :numsamples-order , 

Y(i , : )=data(i+l : i+order) ’ ; 

end; 

b=-data(l :numsamples-order) ; 


7. 

7.  Do  SVD  on  Y 
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y. 

[U,S,V]=svd(Y); 

•/. 

y.  Matrix  probably  won’t  be  square.  Need  to  invert  diagonaly  elements 
y,  greater  than  a  threshold  (10"-9) 

y. 

[numrows,numcols]=size(S) ; 

R=min( [numrows .numcols] ) ; 
for  i=l:R, 

if  S(i,i)>=10'(-9) 

S(i,i)=l/S(i,i) ; 
else 

S(i,i)=0; 

end; 

end; 


y. 

y.  Solve  for  theta.  inv(A)=VSU’.  (To  be  shown  in  appendix  of  thesis). 

y. 

theta=V*S. ’*(U’)*b; 

y. 

y,  Don’t  forget  to  tack  on  a  1  at  the  beginning! 

y. 

theta= [1 ; theta] ; 

B.JS  Determining  Damped- Exponential  Complex  Amplitudes 
%  function  a=getAmps(data,rho) ; 

y. 

y,  Function;  get  Amps,  m 

y.  Description:  This  function  estimates  the  complex  amplitudes  associated  with 
y,  each  complex  pole  previously  estimated  from  the  data. 

y. 

y.  Author:  Capt  A1  Arb,  USAF 
y.  Date:  29  Jul  96 

y.  Modified:  30  Jul  96  -  Added  comments. 

y. 

y,  Input  paraimeters: 

y,  data:  The  original  frame  of  speech. 

y,  rho:  The  complex  poles  previously  estimated  from  the  data. 

y. 

y.  Output  Parameters: 

y,  a:  The  estimated  complex  amplitudes. 
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7. 

7,  Subroutines  directly  called: 

7t  none 

7. 

7.  Subroutines  indirectly  called: 
7t  none 

7. 

function  a=getAmps(data,rho) ; 


7. 

7i  Initialize  variables  and  matrices 

7. 

data=data( :) ; 
numsamples=length(data) ; 
rho=rho ( : ) ; 

R=ones (numsamples , length(rho) ) ; 


7. 

7.  Build  R  matrix  for  equation  Ra=d 

7. 

for  i=l :numsamples-l , 

R(i+1,  :)  =  (rho’)  ."'i; 
end; 


7. 

7.  Solve  for  a 

7. 


a=pinv(R)*data; 

Damped- Exponential  Re-synthesis  Routine 
function  Y=reconst (amp, rho, numsamples)  ; 

7. 

7i  Function:  reconst. m 

7.  Description:  This  function  synthesizes  the  new  frame  of  speech  using  the  complex 
7.  poles  and  amplitudes  previously  estimated  as  the  parameters  for  the 

7i  exponentially  damped  sinusoids. 

7. 

7.  Author:  Capt  A1  Arb,  USAF 
7i  Date:  29  Jul  96 

7.  Modified:  30  Jul  96  -  Added  comments. 

7. 

7.  Input  parameters: 

7.  amp:  The  complex  amplitudes  previously  estimated  from  the  data. 
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'/.  rho:  The  complex  poles  previously  estimated  from  the  data. 

%  numsamples:  The  number  of  samples  to  synthesize. 

7. 

7.  Output  Parameters: 

7i  Y:  The  synthetic  frame 

7. 

7.  Subroutines  directly  called: 

*1,  none 

7. 

%  Subroutines  indirectly  called: 

'/,  none 

7. 

function  Y=reconst(amp, rho, numsamples) ; 


7. 

7.  initialization 

7. 

amp=amp ( : ) ; 
rho=rho ( : ) ; 

R=ones (numsamples, length (rho)) ; 


Y=Ramp  where 
rho(k-l)~0  I 
rho(k-l)~l  I 
:  I 

rho(k-l)''n-l  I 


for  i=0:numsamples-l , 
R(i+1, :)=(rho’) ."i; 
end; 

7t  Solve  equation 


7. 

7.  Build  R  matrix  for  summation. 


7. 

7. 

7. 

7. 

7. 


R= 


|rho(0)''0  rhod)'© 
lrho(0)"l  rho(l)''l 

I  :  : 

|rho(0)''n-l  rho(l)''n-l 


Y=R*amp; 
Y=real(Y) ; 


B.I4  Determining  Error  in  Transition  Point  Estimation 
7.  function  point  =  calcerror  (original, closed, open)  ; 

7. 

V,  Function:  calcerror. m 

7.  Description:  Using  the  error  between  the  two  synthetic  pitch  periods 
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*/,  and  the  original,  the  transition  point  is  identified  as  the 

*/,  point  of  intersection  of  the  two  error  waveforms. 

•/. 

*/.  Author:  Capt  A1  Arb,  USAF 
7,  Date:  29  Jul  96 

7,  Modified:  31  Jul  96  -  Added  comments. 

7. 

%  Input  parameters: 

7  original:  The  original  pitch  period  of  data. 

7  closed:  The  synthetic  pitch  period  generated  using  parameters 

7  estimated  over  the  small  window  at  the  begiiming  of 
7  the  pitch  period. 

7  open:  The  synthetic  pitch  period  generated  using  parameters 

7  estimated  over  the  small  window  at  the  beginning  of 
7  the  pitch  period. 

7 

7  Output  Parameters: 

7  point :  The  point  at  which  the  vocal  tract  parameters  should 

7  be  allowed  to  change. 

7 

7  Subroutines  directly  called: 

7  none 

7 

7  Subroutines  indirectly  called: 

7  none 

7 

function  point  =  calcerror  (original,closed,open) ; 

5^ - 

7  Peak  finding/ curve  fitting  approach 
- 

7  Connect  the  peaks  of  the  error  signal  to  get  smooth  error  waveform. 

7 

clerror=( (original-closed) ."2) ./sum( (original-closed) .~2) ; 

7 

7  w  is  the  number  of  points  on  each  side  of  the  center  of  the  analysis 
7  window  used  for  determining  peaks. 

7 

w=2; 

thresh=10''  (-8) *max(abs (clerror) ) ; 
newclerror=clerror ; 

7 

7  Loop  throut 
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for  j =w+l: length (clerror)-w; 


y. 

7,  Find  the  peak  in  the  analysis  window. 

7. 

[peak,  peakbin]=max(aba(clerror(j-w: j+w))) ; 
peak=»peak(l)  ; 
peakbin=peakbin(l) ; 
if  peak  >  thresh, 

7. 

7«  Keep  the  max  point  and  set  all  others  to  zero. 

7. 

if  pe«ikbin==l , 

newclerror ( j -w+1 : j  +w) =zeros (2*w , 1 ) ; 
elseif  peakbin  ==  2*w+l, 

newclerror( j-w: j+w-l)=2eros(2*w, 1) ; 
else 

newclerror(j-w: j-w+peakbin-2)=zeros(peakbin-l , 1) ; 
newclerror( j-w+peakbin: j+w)=zeros(2*w+l-peakbin, 1) ; 
end; 
end; 
end; 


7, 

7,  Now  all  we  have  in  the  error  vector  is  the  peaks.  Linearly 
7i  interpolate  lines  between  each  one. 

7. 

peaks=f ind(newclerror) ; 

newclerror ( 1 : peaks ( 1 ) ) =linspace (0 , newclerror (peaks ( 1) ) , peaks ( 1) ) ; 
if  length(peaks)>2 

for  k=2 : length (peaks ) , 

newclerror (peeiks (k-1 ) ; peaks (k) ) =linspace (newclerror (peaks (k-1 ) ) , 
newclerror(peaks(k)) ,peaks(k)-peaks(k-l)+l) ; 
end; 

else  k=2; 
end; 

newclerror (peaks (k) : length (newclerror) ) =linspace (newclerror (peaks (k) ) , 
newclerror(length(clerror)) ,length(clerror)-peaks(k)+l) ; 


7. 

7,  Repeat  process  for  the  other  synthetic  frame. 

7. 


operror=(  (original-open)  .'"2)  ./sum(  (original-open)  .~2)  ; 
w=2; 
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thresh®!©" (-8) *max (abs (operror) ) ; 

newoperror=operror ; 

for  j=w+l :length(operror)-w; 

[peak,  peakbin]=max(abs(operror(j-w: j+w))) ; 
peak=peak(l) ; 
peakbin=peakbin(l) ; 
if  peak  >  thresh, 
if  peakbin==l, 

newoperror(j-w+l : j+w)=zeros(2*w,l) ; 
elseif  peakbin  ==  2*w+l , 

newoperror ( j  ~w : j  +w- 1 ) =zeros (2*w , 1) ; 

else 

newoperror ( j -w : j -w+peakbin-2) = _ 

zeros (peakbin-1 , 1 ) ; 

newoperror( j-w+peakbin: j+w)=. . . 
zeros(2*w+l-peakbin, 1) ; 
end; 

end; 

end; 

peaks=find (newoperror) ; 

newoperror ( 1 ; peaks  < 1) ) =linspace (0 .newoperror (peaks ( 1 ) ) , peaks ( 1) ) ; 
if  length (peaks) >2 

for  k=2:length(peaks) , 

newoperror (peaks (k-i) : peaks (k))=linspace (newoperror (peaks (k-1)) .... 
newoperror (peaks (k) ) ,peaks(k) -peaks (k-l)+l) ; 
end; 

else  k=2; 
end; 

if  length(peaks)>=2, 

newoperror(peaks(k) : length (newoperror) )=1 inspace (newoperror (peaks (k)) , . . . 
newoperror (length (newoperror)) ,length(operror) -peaks (k)+l) ; 
elseif  length(peaks)==l 

newoperror ( 1 : peaks ( 1) ) =1 inspace (newoperror ( 1) , newoperror (peaks ( 1) ) , peaks ( 1) ) 
newoperror(peaks(l)+l:length(newoperror))=linspace(newoperror(peaks(k)+l) , . . 
newoperror(length(newoperror)) ,length(newoperror)-pe€iks(l)+l) ; 

end; 

*/. 

7,  Define  the  window  over  the  entire  pitch  period  that  we  will  look  for 
7i  a  crossing  point.  Define  the  window  as  207.-507,  of  the  period. 

7. 

swin=ceil(0.2*length(clerror)) ; 
ewin=ceil(0.6*length(clerror)) ; 
k=swin ; 
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if  k  <  20 
k=20; 
end; 

crossfound=0; 

point=0; 

while  (“crossfound)&(k<=ewin) 


•/. 

*/,  If  the  error  from  the  period  generated  from  the  end-of-frame 

*/,  parameters,  is  less  than  the  that  generated  from  the  start-of -frame 

*/.  parameters,  we’ve  found  the  point.  Otherwise,  keep  searching. 

7. 

if  newoperror(k)  <=  newclerror(k) 
point=k; 
crossfound=l ; 

else 
k=k+l ; 
end; 

end; 


7. 

7<  If  no  point  was  found,  define  the  transition  point  as  the  end  of 
7«  the  search  window. 

7. 

if  "point 

point=ewin; 

end; 


B.15  Re- synthesizing  in  Transition  Point  Estimation 


7.  function  Y=synth4tran(amp,rho ,numsamples ,mlength)  ; 

7. 


7i  Function:  synth4tran.m 

7.  Description:  This  function  synthesizes  the  new  frame  of  speech  using  the 
complex  poles  eind  amplitudes  previously  estimated  as  the 
parameters  for  the  exponentially  damped  sinusoids.  However,  it 
assumes  that  the  time=0  point  is  point  numsamples-mlength+1  and 
that  the  time  indices  are  negative  below  this  point. 


7.  Author:  Capt  A1  Arb,  USAF 
7.  Date:  29  Jul  96 

7.  Modified:  31  Jul  96  -  Added  comments. 
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y. 

y.  Input  parameters: 

y,  amp:  The  complex  amplitudes  previously  estimated  from  the  data. 

'/•  rho:  The  complex  poles  previously  estimated  from  the  data, 

y.  numsamples:  The  number  of  samples  to  synthesize, 

y,  mlength:  The  length  of  the  frame  the  model  peurameters  were 

y,  estimated  over. 

y. 

y.  Output  Parameters: 
y,  Y:  The  synthetic  frame 

y. 

y.  Subroutines  directly  called: 
y.  none 

y. 

y,  Subroutines  indirectly  called: 
y.  none 

y. 

function  Y=synth4tr2ui(amp, rho .numsamples .mlength) ; 


y,  Initialize  vectors/matrices 

amp=amp ( : ) ; 
rho=rho ( : ) ; 

R=ones (numsamples . length(rho) ) ; 
spoint=mlength~numsamples+l ; 

y.  Build  R  matrix  slowing  for  initial  point  to  be  other  than  1 
y.  and  endpoint  to  be  other  than  the  end  of  the  model  frame. 

for  i=spoint : mlength. 

R(i+abs(spoint)+l .  :)=(rhoO  .~i; 
end; 


y. 

y,  Solve  for  Y 

y. 

Y=R*amp; 
Y=real(Y) ; 


B.16  Difference  Equation  Implementation  for  LPC  synthesis 
y,  function  y=diffeq(B.A.X.ic)  ; 

y. 
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Function:  diffeq.m 

*/,  Description:  This  function  implements  the  difference  equation: 

7. 

7.  y(n)=B(l)*X(l)-A(n-l)*y(n-l)-A(n-2)*y(n-2)-. .  .-A(n-p)*y(n-p) 

7. 

7i  where  p  is  the  number  of  coefficients  in  the  denominator  polynomial. 

7. 

7t  Author:  Capt  A1  Arb,  USAF 
7t  Date:  1  Aug  96 
7t  Modified: 

7. 

7.  Input  parameters: 

7.  B:  The  numerator  coefficient,  typically  the  Gain  for  LPC  synthesis. 

7,  A:  The  denominator  polynomial  coefficients. 

7.  X:  The  filter  driving  function. 

7.  ic:  The  initial  conditions  of  the  filter. 

7. 

7.  Output  Parameter: 

7i  y:  The  filter  output. 

7. 

7*  Subroutines  directly  called: 

7.  none 

7. 

7.  Subroutines  indirectly  called: 

7.  none 

7. 


function  y=diffeq(B,A,X,ic) ; 


y=ic; 

B=B(:); 

A=A(:); 

X=X(:); 
ic=ic(:) ; 

for  n=length(ic)+l : length(ic)+length(X) , 

y(n)  =  (B(l)*X(n-length(ic))-A(2:length(A)) ’*y(n-l :-l :n-length(A)+l))/A(l)  ; 
end; 

y=y(length(ic)+l:length(y)) ; 

B.17  Routine  for  Two-Phase  LPC  Synthesis 

7.  function  synth=synthlpc2(samples .order .past  ,vuv, phase) ; 

7. 

7.  Function  Name:  synthlpc 

7.  Description:  This  routine  performs  LPC  analysis-synthesis  on  a 
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'/,  frame  of  speech  (samples)  and  returns  a  synthetic  frame. 

•/. 

'/,  Author:  Capt  A1  Arb,  USAF 
7.  Date:  2  Aug  96 
5C  Modified: 

7. 

*X  Input  Parameters: 

%  samples:  The  original  frame  of  speech. 

7>  order:  The  analysis  model  order. 

past:  The  initial  conditions,  i.e.  the  last  "order"  number 
7i  of  samples  from  the  previous  frame. 

vuv:  Voiced/unvoiced  flag:  O=unvoiced,  non2ero=voiced. 

%  phase:  l=initial  phase  of  voiced  speech,  2=second  phase. 

7. 

%  Output  parameters: 

'!%  synth:  The  synthetic  frame. 

7. 

%  Subroutines  directly  called: 

covlpc.m 
7i  diffeq.m 

7. 

7.  Subroutines  indirectly  called: 

7a  none 

7. 

function  synth=synthlpc2(samples, order, past, vuv, phase) ; 


7, 

7t  Check  voiced/unvoiced  flag 

7. 

if  vuv. 


7. 

7a  get  LPC  coeficients  and  error  using  Covariance  method  LPC. 

7a 

[a , E] =covlpc ( [past ; samples] , order) ; 

7a 

7a  calculate  gain  term  as  the  square  root  of  the  error. 

7a 

G=sqrt (abs(E) ) ; 

7a 

7a  Synthesize  the  new  voiced  frame  using  an  impulse  driven  filter  or  no 
7a  input  depending  on  phase. 

7a 

if  phase==l , 

synth=diffeq(G,a, [l;zeros(length(samples)-l,l)] .past); 
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else 

synth=diffeq(G,a,zeros(length(samples) ,1) .past) ; 
end; 


else  y,  \mvoiced 

[a,E]=covlpc( [past; samples] .order) ; 

•/. 

y.  calculate  gain  term  as  the  square  root  of  the  error. 

y. 

G=sqrt(abs(E)) ; 

y. 

y.  Generate  synthetic  unvoiced  frame  using  white  noise  driven  filter. 

y. 

rndseq=r2indn(length<samples)  .1) ; 
rndseq=rndseq.  /sqrt  (sum(rndseq.  "'2)  )  ; 
synth=diffeq(G.a.rndseq.past) ; 

end; 
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